T

tschatzl @ github | Thomas Schatzl

Thomas Schatzl, Oracle Java Hotspot GC 团队成员。在 Java GC 方面有丰富经验。

JDK 25 G1/Parallel/Serial GC 变更说明

JDK 25 RC 1 已发布,按照惯例,本文简要介绍该版本中 OpenJDK 停顿式垃圾收集器的变更内容。 JDK 25 整个 Hotspot GC 子组件的完整变更列表在此处,共列出约 200 项已解决或关闭的变更,这与近期趋势相符。 让我们从 JDK 25 的重要变更详解开始: Parallel GC Parallel(及 Serial)GC 最重要的变更可能是 JDK-8192647。 如果您的应用程序广泛使用 JNI,可能遇到过在出现 Retried Waiting for GCLocker Too Often Allocating XXX words 警告后虚拟机意外关闭的情况。从本版本开始,这两种收集器都不会再出现此问题。 该修复确保因 JNI 阻塞而无法执行垃圾收集的线程,在放弃前必定会触发自身的垃圾收集。 Serial GC 修复了与 Parallel GC 相同的 GCLocker 问题。 Serial GC 通过 JDK-8346920 修复了在某些情况下避免完整 GC 的问题。 G1 GC JDK-8343782 使 G1 能够将任意老年代区域的记忆集与其他区域合并,相比上一版本实现额外内存节省。此变更的拉取请求中可查看相关图表。下图取自该处,显示针对特定重度使用记忆集的基准测试,其记忆集内存使用量大幅降低。 红线显示上述基准测试在 64 GB Java 堆上的记忆集原生内存使用量,蓝线则为应用该补丁后的情况。虚拟机该部分的原生内存消耗从 2 GB 峰值降至约 0.75 GB 峰值。 G1 现采用额外措施避免老年代空间回收阶段末尾的暂停时间峰值。这些峰值原本因信息不足导致候选区域选择不优所致。 当前流程大致如下:在 Remark 暂停期间,完成全堆活跃度分析后,G1 确定后续要收集的候选区域集合。它尝试选择最高效的区域,即内存收益大且回收成本低的区域。然后,对于这些候选区域,从 Remark 到 Cleanup 暂停期间,G1 重新计算记忆集——包含这些区域中存活对象所有引用的数据结构。记忆集是后续垃圾收集暂停中疏散对象所必需的,因为它们包含指向待疏散对象的引用位置。这些位置在对象移动后必须指向新对象位置。 确定候选区域正是高暂停时间的根源:疏散效率是区域中存活数据量与区域记忆集大小的加权和;G1 倾向于疏散低活跃度且记忆集小的区域,因为这类区域疏散最快。然而,在选定候选区域时,G1 并无记忆集相关信息,实际上 G1 计划在选定后立即重建此信息。迄今为止,G1 仅依赖活跃度,假设记忆集大小与区域中存活数据量成正比。 遗憾的是,这在多数情况下并不成立。大部分场景中这无关紧要:疏散速度本就足够快且暂停时间目标足够宽松,但您可能注意到后续混合收集通常比前期的更慢。最坏情况下,此行为会导致空间回收阶段最后一次混合收集出现大幅暂停时间峰值,因为此时仅剩代价最高的待疏散区域。 此问题的一种解决方案是直接跳过这些区域的疏散:但问题在于,初始候选选择基于在空间回收阶段释放足够空间,以继续运行而无需破坏性的全堆压缩。因此持续将它们排除在空间回收之外可能导致全堆压缩。 JDK-8351405 的变更改进了选择机制:此时虽无记忆集大小的精确指标,是否存在可快速计算的替代指标?实际上确实存在:在标记期间,G1 现对每个区域统计遇到的入向引用数量(仅计数,不存储)。这是极佳的替代指标,因为记忆集以更粗粒度方式存储这些位置。此外,垃圾收集暂停中需完成的工作量终究与引用数量成正比。 G1 在 Remark 暂停的候选选择阶段使用此更精确的效率信息,从而避免这些暂停时间峰值。 来自拉取请求的下图展示了此变更对暂停时间的影响: 图表显示了 JDK 25(棕色)、JDK 24(紫色)及含此变更的 JDK 25(绿色)的混合 GC 暂停时间,暂停时间目标为默认的 200ms。前两条曲线显示空间回收阶段的最后一次混合 GC 总是耗时很长(根据混合垃圾收集间较长间隔可推断峰值出现在最后——期间存在未显示的纯年轻代垃圾收集暂停)。可见,此技术彻底消除了耗时高达 400ms 的峰值。 原定 JDK 25 的写屏障改进(此文已说明)因流程问题未能及时纳入。希望能在 JDK 26 中实现。 部分修复(此处和此处)避免了某些边缘场景中的多余垃圾收集暂停。另一变更提升了特定垃圾收集的性能。 更新 2025-09-03:存在 G1 无法通过透明大页 (THP) 获取大页的回归问题,我们在 JDK 25 发布后才发现。这可能导致所有使用 THP 且配置为 madvise 的应用程序性能回归。该问题将在后续更新版本中修复。 后续计划 当前 G1 领域大量精力投入于实现 G1 自动堆大小调整:通过追踪环境中的空闲内存并相应调整堆大小,使 G1 能自动确定高效且不超出可用内存预算的最大 Java 堆大小。 自动堆大小调整所需的部分变更(如上述避免多余垃圾收集暂停的修复,以及此处或彼处)已并入 JDK 26,若干基础性修改正在评审中(例如此项和该项)。 我们计划在此发布周期内完成的另一 JEP 是使 Ergonomics 始终选择 G1 作为默认收集器,以消除对此特定主题的更多困惑 :) 致谢 … 一如既往感谢所有为又一伟大 JDK 版本做出贡献的参与者。期待在下个(更出色的)版本中与您相见。 Thomas

2025/8/12
articleCard.readMore

JDK 24 G1/Parallel/Serial GC 变更说明

JDK 24 已于几周前正式发布。本文简要概述该版本中 OpenJDK 停顿式垃圾收集器的改进。最初我认为没有太多值得讨论的内容,因此推迟了本文发布,但事实证明我错了。 与上一个版本类似,JDK 24 在 GC 领域的更新相对温和,但未来值得期待——特别是 JDK 25 的改进将在文末提及 :) JDK 24 中 Hotspot GC 子组件的完整变更列表可在此查看,共包含约 190 项已解决或关闭的变更。总体而言没有特别异常之处。 此外,JavaOne 大会也已再次举办。 Parallel GC 通过 JDK-8269870 移除了(通常非常热门的)回收循环中不必要的同步操作。这在某些场景下可能降低停顿时间。 Serial GC Serial GC 代码的清理与重构工作持续推进。 G1 GC G1 通过预测垃圾收集的多个维度来满足停顿时间目标,这些预测既依赖应用程序特性也依赖虚拟机运行环境。典型预测包括内存复制的实际成本、需要更新的引用数量等。G1 在每次虚拟机运行时都会重新训练相关预测器。启动初期由于缺乏样本数据,G1 会使用预设的预估值。理想情况下这些预设值应定期更新维护并适配所有环境——但正如您所想象的,这是不可能完成的任务:这些预设值实际上是很久前在某台“大型”SPARC机器上确定的,然后……就被遗忘了。 这些数值的缺陷会带来实际影响:如今几乎所有运行JVM的机器都比参考机器更快,导致预设值过于保守(例如预期成本远高于实际值),因此在前几次垃圾收集中,G1的停顿时间会不必要地偏短,直到新测量的数值最终有效覆盖这些预设值。实测表明,大约需要30次有效垃圾收集后,G1才能适应当前环境。 JDK-8343189 的变更修改了预测器使用新实测值的方式:首次实测值将直接覆盖原有预设值,而非在预设值基础上累加。这意味着除首次垃圾收集外,后续都会使用实际采样数据进行预测。这极大加速了G1对应用与环境的适应过程。 但存在一个小问题:原有保守估计的优势在于能基本保证G1在前几次收集中不超出停顿时间目标,它们显著抑制了预测器输出。新机制下,前几次收集的停顿时间可能出现更多初始超调现象。 我们同时持续致力于降低本地内存使用,本次重点优化了记忆集。根据垃圾收集调优指南相关章节说明,G1 现已将年轻代记忆集作为整体单元管理,通过去重操作节省内存。年轻代记忆集条目减少也会(略微)缩短垃圾收集停顿。 JDK 23 的博客文章在文末通过附加图表阐释了这一设计理念。 未来展望 上一版本的本节内容详细阐述了单记忆集多区域设计如何应用于任意区域集及其影响。在合并年轻代记忆集的基础上,已集成至JDK 25的 JDK-8343782 将该理念扩展至老年代区域,可实现更大的本地内存节省。这意味着该变更将默认合并老年代区域集。 计划于JDK 25实施的可能是应用与G1垃圾收集器交互机制有史以来最重大的变革:写屏障——虚拟机中为每个引用写入执行、用于与垃圾收集器同步的短代码,对应用吞吐量有重大影响(在G1中尤为显著)——已被彻底重构。通过大幅精简写屏障,有望实现显著性能提升。 我们正在筹备一项JEP提案以在未来数月集成该变更。更多信息请参阅实现CR、对应PR及本博客关于此主题的详细文章。 总体而言,此变更可带来最高10%的免费吞吐量提升、更短停顿、更优代码生成(及代码体积节省)。虽然本地内存使用存在轻微回归,但我们认为在可接受范围内,具体可通过下图说明: 上图展示了在20GB Java堆上运行BigRAMTester基准测试时,各近期JDK版本中G1收集器的本地内存使用情况(注:JDK8使用8GB本地内存)。JDK 17到JDK 21初期基础内存使用从1.2GB降至约800MB,这是由于此处描述的变更所致。JDK 24线(橙色)显示了年轻代整体使用单记忆集(及JDK-8225409)的影响。 最后,JDK 25(红色虚线“最新”线)增加了JDK-8343782(老年代记忆集合并)及实现进一步内存降低的写屏障变更。那么回归现象何在——JDK 25线看起来明明最优?实际情况是,这些变更需要额外分配比之前更大的静态内存块(固定约0.2% Java堆大小),其影响体现在图表最左端。虽然在此基准测试中记忆集的节省超过了新增数据结构的内存占用,但普遍情况并非如此。 当前里程碑已完成,除通过JEP流程最终集成至发行版外,该领域更多工作仍在进行中(因我们本次变更相对保守)。有理由相信未来吞吐量还将进一步提升。 此外,hotspot-gc-dev邮件列表正在讨论一些潜在重要议题: 长期来看,有项目正致力于为停顿式收集器引入类似ZGC的自动堆大小调整功能。首批贡献已开始提交,感谢微软与谷歌对此工作的贡献! 关于将G1设为“真正”默认收集器的公开讨论正在进行:当前若未在命令行指定垃圾收集器,虚拟机自适应机制可能在小堆低CPU资源环境中选择Serial垃圾收集器而非G1。越来越多实测表明,在开箱即用(即无需额外调优)时,G1在多数或全部指标上已接近或超越Serial收集器。 若您有后一议题的相关数据,欢迎参与讨论。 致谢 ……照例感谢所有为这个伟大JDK版本做出贡献的同行。期待在下个(更出色的)版本中与您相遇。 Thomas

2025/4/1
articleCard.readMore

G1的新写屏障

The Garbage First (G1) collector’s throughput sometimes trails that of other HotSpot VM collectors - by up to 20% (e.g. JDK-8253230 or JDK-8132937). The difference is caused by G1’s principle to be a garbage collector that balances latency and throughput and tries to meet a pause time goal. A large part of that can be attributed to the synchronization of the garbage collector with the application necessary for correct operation. With JDK-8340827 we substantially redesign how this synchronization works for much less impact on throughput. This post explains these fairly fundamental changes. Other sources with more information are the corresponding draft JEP and the implementation PR. Background G1 is an incremental garbage collector. During garbage collection, a large part of time can be spent on trying to find references to live objects in the areas of the application heap (just heap in the following) to evacuate them. The most simple and probably slowest way would be to look through (scan) the entire heap that is not going to be evacuated for such references. The stop-the-world Hotspot collectors, Serial, Parallel and G1 in principle employ a fairly old technique called Card Marking [Hölzle93] to limit the area to scan for references during the garbage collection pause. In an attempt to better keep pause time goals, G1 extended this card marking mechanism: concurrent to the application, G1 re-examines (refines) the cards marked by the application and classifies them. This classification helps in the garbage collection pause to only need to scan cards important for a particular garbage collection. extra code compiled into the application (write barriers) remove unnecessary card marks, reducing the amount of cards to scan further. This comes at additional cost as the next sections will show. Card Marking in G1 Card Marking divides the heap into fixed-size chunks called cards. Every card is represented as a byte in a separate fixed-size array called a card table. Each entry corresponds to a small area, typically 512 bytes, in the heap. A card may either be marked or unmarked, a mark indicating the potential presence of interesting references in the heap corresponding to the card. An interesting reference for garbage collection is a reference that refers from an area in the heap that is not going to be garbage collected to one that is. When the application modifies an object reference, additional code compiled into application, the write barrier, intercepts the modification and marks the corresponding card in the card table. Figure 1 above shows an example execution of a hypothetical assignment of the field a in an object x of type X with a value of y. After writing the value into the field, the write barrier code (to be exact, post write barrier code, i.e. code added after setting the value) marks the card. This is where the Serial and Parallel garbage collectors stop at: they let the application accumulate card marks until garbage collection occurs. At that time all of the heap corresponding to the marked cards is scanned for references into the evacuated area. In most applications this is okay effort-wise: the number of unique cards that need to be scanned during garbage collection is very limited. However, in other applications, scanning the heap corresponding to cards (scanning the cards) can take a very significant amount of total garbage collection time. G1 tries to reduce this amount of card scanning in the garbage collection pause by several means. The first is using extra garbage collection threads running concurrent to the application clearing, re-examining and classifying card marks because: references are often written over and over again between garbage collections. A card mark caused by a reference write may, by the time the next garbage collection occurs, not contain any interesting reference anymore. by classifying card marks according to where they originate from, it is possible to only scan marked cards that are relevant for this particular garbage collection during the garbage collection. Figure 2, 3 and 4 give details about this re-examination (refinement) process. First, Figure 2 shows that in addition to the actual card mark mentioned above, the write barrier stores (enqueues) the card location (in this case 0xabc) in an internal buffer (refinement buffer) shown in green so that the re-examination garbage collector threads (refinement threads) can later easily find them again. There is an artificial delay based on available time in the pause for scanning cards and the rate the application generates new marked cards between card mark and refinement. This delay helps decreasing the overhead of an application repeatedly marking the same cards, avoiding that the same cards will be repeatedly enqueued for refinement (as they are already marked). The delay also increases the probability that the references in the card itself are not interesting anymore. In the next step, shown in Figure 3, refinement threads will pick up previously enqueued cards for re-examination. In this example, the card at location 0xdef will be refined. The figure also shows the remembered sets in light blue: for every area to evacuate, G1 stores the set of interesting card locations for this area. In this figure every area has such a remembered set attached to it, but there may not be one currently for some. Areas may also be discontiguous in the heap (JDK-8343782 and JDK-8336086). The refinement thread also unmarks the card before looking at the corresponding contents of the heap. Finally, Figure 4 shows the step where the refinement threads put the examined card (0xdef) into the remembered sets of the areas for which that card contains an interesting reference at this point in time. Since the heap covered by a card may contain multiple interesting references, multiple remembered sets may receive that particular card location. The end result of this fairly complicated process is that, compared to the other throughput collectors in the Hotspot VM, in the garbage collection pause G1 only needs to scan cards from two sources: cards just recently marked dirty by the application and not yet refined. cards from the remembered sets of areas that are about to be collected. G1 will merge these two sources, marking cards from the remembered sets on the card table, before scanning the card table for card marks [JDK-8213108]. Depending on the application this can result in a significant reduction of time spent in card scanning during garbage collection compared to regular card marking. Write Barrier in G1 Barriers are small pieces of code that are executed to coordinate between the application and the VM. Garbage collection extensively uses them to intercept changes in memory. The Serial, Parallel and G1 garbage collectors use, write barriers on writes to references, i.e. every time the application writes a value into a reference. The VM executes additional code the vicinity of that write. The figure above visualizes this: for a write of the value y into a field x.a, next to the actual write, code unrelated to the write used for synchronization with the garbage collector is executed; Serial and Parallel GC only have a post-write barrier (after the write), while G1 has both pre- and post-write barriers (before and after the write). G1 uses the pre-write barrier for unrelated matters to this discussion, so the following text will just use “write barrier” or just “barrier”. The previous section hinted at the responsibilities of the write barrier in G1: mark the card as dirty store the card location in refinement buffers if the card has not been marked yet This sounds straightforward, but unfortunately there is some complication that Figure 5 shows: both the application and the refinement threads might write to the same card at the same time. The former marking it, the latter clearing it. This might result in lost updates to the remembered sets where the refinement thread would observe the card mark but not observe the write of the value without additional precautions. In fact, this requires some fairly costly memory synchronization in the write barrier. Additionally, concurrent refinement of cards can be expensive, particularly if there are no extra processing resources available. So, the G1 barrier contains extra filtering code to avoid card marks that are generated by reference writes that make no difference for garbage collection. The G1 write barrier will not mark a card if the reference assignment writes a reference that is not interesting, not crossing areas. the code assigns a null value: these do not generate links between objects, so the corresponding marked cards are unnecessary. the card is already marked, which means that it is already scheduled for refinement. Figure 6 compares the sizes of the resulting write barriers, the directly inlined part of the G1 write barrier (there is an additional part not shown here which is executed somewhat rarely) on the left with the whole Serial and Parallel GC write barrier on the right ([Protopopovs23]). Without going into detail, instead of three (x86-64) instructions, the G1 write barrier takes around 50 instructions. The bars to the left of the G1 write barrier roughly correspond to above tasks: blue for the filters, green for the memory synchronization and the actual card mark, and orange for storing the card in the refinement buffers for the refinement threads Impact G1 uses a large write barrier with many mechanisms to minimize performance impact of the memory synchronization necessary for correctness and overhead of concurrent work. All the effort to reduce memory synchronization is wasted if the application’s memory access pattern does not fit them and actually performs lots of fairly random reference assignments across the whole heap which to a large part result in enqueuing of many card locations for later refinement like BigRAMTester [JDK-8152438]. The other poor fit for this write barrier are applications that execute the write barrier in tight loops and at the same time extremely rarely generate a card mark with interesting references so that no concurrent refinement is necessary or actually ever performed (e.g. JDK-8253230). These are hampered by the large code footprint of the barrier, inhibiting compiler optimizations, or just execute slowly due to unnecessary allocation of CPU resources. A New Approach Previous sections showed that a large part of the G1 write barrier is required by the per-card mark memory synchronization to guarantee correctness in case of concurrent writes to the card table and storing the card locations for subsequent refinement. To remove the memory synchronization, similar to ZGC’s double-buffered remembered sets [JEP439], this new approach uses two card tables. Each set of threads, the application threads and the refinement threads, gets assigned their own card table, each initially completely unmarked. Each set of threads only ever writes different values to their card table (just “card table” and “refinement table*” respectively in the following), obviating the need for fine-grained memory synchronization during actual card mark. Similar to before there is a heuristic that tracks card mark rate on the application card table, and if the heuristic predicts that there will be too many card marks on that card table to meet pause time goal in the next garbage collection, G1 swaps the card tables atomically to the application. The application then continues to mark its card table (previously the refinement table), while garbage collection refinement threads work to re-examine all marks from the refinement table (the previous application card table). This removes the need for memory synchronization in the G1 write barrier. Additionally, the refinement table is directly used as storage for the to-be-refined marked cards, given that finding the marked cards is not prohibitively expensive. This completely removes the need for the write barrier code to store the card locations in refinement buffers, and the memory synchronization. New Card Marking in G1 Figure 7 shows this new arrangement of data structures: there are now two card tables, and the write barrier as before marks the card on the (application) card table. When the card table accrued enough marks, G1 switches card tables atomically. Figure 8 shows an example where the application card table and the refinement table were just switched, and the application already continued marking cards on the former refinement table. Refinement threads start re-examining cards from the refinement table (previously application card table) as before as described. Card marks are cleared until all marked cards have been processed. Since the application threads continue to mark cards on their card table, no synchronization between application threads and refinement threads is required. New Write Barrier in G1 At minimum, the new G1 write barrier requires the card mark, similar to Parallel GC. The difference is that unlike in the Parallel GC write barrier, the card table base address is not constant as it changes every card table switch. So Parallel GC can inline the card table address into the code stream, G1 needs to reload it every time from thread local storage. The final current post write barrier for G1 reduces to the filters and the actual card mark. For a given assignment x.a = y, the VM now adds the following pseudo-code after the assignment: (1) if (region(x.a) == region(y)) goto done; // Ignore references within the same region/area (2) if (y == nullptr) goto done; // Ignore null writes (3) if (card(x.a) != Clean) goto done; // Ignore if the card is non-clean (4) (5) *card(x.a) = Dirty; // Mark the card (6) done: Line (1) to (3) implement the filters. They are almost the same as before, with a slightly different condition for the last filter due to an optimization. Without the filters, there were some regressions compared to the original write barrier with the filters; the filters also decrease the number of cards that have not been scanned during the garbage collection pause, and the number of cards to be re-examined, so they were kept for now. Finally, line (5) actually marks the card with a “Dirty” color value. The original card marking paper uses two different values for card table entries, i.e. colors: “Marked” and “Unmarked”, or typically named “Dirty” and “Clean” in the Hotspot VM. In total G1 uses five colors which store some additional information about the corresponding heap area: clean - the card does not contain an interesting reference. dirty - the card may contain an interesting reference. already-scanned - used during garbage collection to indicate that this card has already been scanned. Necessary for the potentially staged approach of garbage collection to better keep pause time. to-collection-set - the card may contain an interesting reference to the areas of the heap that are going to be collected in the next garbage collection (the collection set, hence the name). This collection set always contains the young generation because it is always collected at every garbage collection. Refinement can skip scanning these cards because they will always be scanned later during garbage collection as G1 always collects the young generation. Adding these cards to remembered sets, even if they contained references to regions not in the collection set, is not needed either, they would actually represent duplicate information. Effectively, the set of to-collection-set cards also store the entire remembered set for the next collection set on the card table, avoiding extra memory. To keep the write barrier simple, it only colors cards as “Dirty”. The additional overhead determining whether a reference refers to an object in the collection set is too expensive here. from-remset - used during garbage collection to indicate that the origin of this card is a remembered set and not a recently marked card. This helps distinguishing cards from remembered sets from cards from not yet examined cards to more accurately model the application’s card marking rate used in heuristics. Only the last two card colors are new. The use of the to-collection-set color explains the condition used in line (3) of the write barrier above: it avoids the application overwriting this value unnecessarily excluding benign races. Switching the Card Tables and Refinement The goal of the card table switching process is to make sure that all threads in the system agree on that the refinement table is now the application card table and the other way around to avoid the problematic situation described earlier in Figure 5. The process is initiated by a special background thread, the refinement control thread. It regularly estimates whether the currently estimated number of cards at the start of the next garbage collection would exceed the allowed number of not re-examined cards given card examination rate. If refinement work is necessary, it also calculates the number of refinement worker threads, which do the actual work, needed to complete before the garbage collection. This refinement round consists of the following phases: Swap references to the card tables. This includes global card table pointers that newly created VM threads and VM runtime calls use, and every internal thread’s local copy of the current reference to the application card table. This step uses thread local handshakes (JEP 312) and a similar technique for VM internal, garbage collector related, threads. This ensures correct memory visibility: After this step, the entire VM uses the previous refinement table to mark new cards, and all card marks and reference writes by the application at that point are synchronized. Snapshot the heap to gather internal data about the refinement work. For every region of the heap the snapshot stores current examination progress. This allows resuming the refinement at any time and facilitates parallel work. Re-examine (sweep) the refinement table containing the marked cards. Refinement worker threads walk the card table linearly (hence sweeping), claiming parts of it to look for marked cards using the heap snapshot. As Figure 9 shows, the refinement threads re-examine marked cards, with the following potential results: Cards with references to the collection set are not added to any remembered set. The refinement thread marks these cards as to-collection-set in the application card table and skips any further refinement. If the card is marked as to-collection-set, the refinement thread will re-mark this card on the card table as such and not examine the corresponding heap contents at all. Dirty cards which correspond to the heap that can not be examined right now will be forwarded to the application card table as dirty. Interesting references in heap areas corresponding to dirty cards cause that card to be added to the remembered sets. During card examination, the card is always set to clean on the refinement table. Parts of the refinement table corresponding to the collection set are simply cleaned without re-examination as these cards are not necessary to keep. During evacuation of these regions, the references of live objects in this area will be found implicitly by the evacuation. Writing directly to the application card table by the refinement threads is safe: memory races with the application are benign, the result will be at worst another dirty card mark without the additional to-collection-set color information. Calculate statistics about the recent refinement and update predictors. Any of these steps may be interrupted by a safepoint, which may be a garbage collection pause that evacuates memory. If not, the refinement worker threads will continue their refinement work using the heap snapshot where they left off. Garbage Collection and Refinement Refinement heuristics try to avoid having garbage collection interrupt refinement. In this case, the refinement table is completely unmarked at the start of the garbage collection, and all the not-yet examined marked cards are on the application card table, just where the following card table scan phase expects them. No further action except putting the remembered sets of areas to be collected on the application card table must be taken to be able to search for marked cards on the card table efficiently. Previously G1 had information about the location of all marked cards, these locations were either recorded in the remembered sets, or in the refinement buffers to refine cards. Based on this, G1 could create a more detailed map of where marked cards were located, and only search those areas for marked cards instead of searching the whole card table. However, searching for marked cards is linear access to a relatively little area of memory, so very fast. The absence of more precise location information for marked cards is also offset by not needing to calculate this information. In the common case of garbage collections with only young generation heap areas to evacuate, there is nothing to do in this step as the young generation area’s remembered set is effectively tracked on the card table. If a young garbage collection pause occurs at any point during the refinement process, the garbage collection needs to perform some compensating work for the not yet swept parts of the refinement table. In this case the G1 garbage collector will execute a new Merge Refinement Table phase that performs a subset of the refinement process: (Optionally) Snapshot the heap as above, if the refinement had been interrupted in phase 1 of the process. Merge the refinement table into the card table. This step combines card marks from both card tables into the application card table. This is a logical or of both card tables. All marks on the refinement table are removed. Calculate statistics as above. The reason why the refinement table needs to be completely unmarked at the start of the garbage collection is that G1 uses it to collect card marks containing interesting references for objects evacuated during the garbage collection in the heap areas the objects are evacuated to. This is similar to previously used extra refinement buffers to store those. At the end of the young garbage collection, the two card tables are swapped so that all newly generated cards are on the application card table, and the refinement table is completely unmarked. A full collection clears both card tables as this type of garbage collection does not need this information. Performance Impact This section contains some information about the aspects throughput, (native) memory footprint and latency compared to before these changes. Throughput In summary, throughput improves significantly for applications that execute the write barrier in full often because of the removal of the fine-grained synchronization in the barrier. Another aspect is that the new write barrier also significantly reduces refinement overhead because finding marked cards is faster (linear search vs. random access). There are also savings here due to reducing refinement work by keeping the young generation remembered set on the card table as these cards are at most re-examined once. the reduced CPU resource cost (size) of the write barrier enables better performance, either due to better optimization in the compiler or due to being easier to execute. In G1, throughput improvements may manifest as reductions in used heap size. The extent of the improvements can differ by microarchitecture. Parallel GC is still slightly ahead, there is more work going on reducing this difference further. Native Memory Footprint The additional card table takes 0.2% of Java heap size compared to JDK 21 and above. In JDK 21 one card table sized data structure has been removed, so JDKs before that do not show that difference. Some of that additional memory is offset by removal of the refinement buffers for card locations. Additionally, memory usage is reduced by keeping the collection set’s remembered set on the card table, taking no extra space. The optimization to color these remembered set entries specially keeps duplicates from appearing in the other remembered sets, also reducing native memory. In some applications these memory reductions completely offset the additional card table memory usage, but this is fairly rare. Particularly applications that did not have large remembered sets for the young generation, which are mostly very throughput-oriented applications, show the above mentioned additional memory usage. The refinement table is only required if the application needs to do any refinement. So, the refinement table could be allocated lazily, i.e. only if there is some refinement. There is a large overlap between such applications and above very throughput-oriented applications. This is not implemented in the current version. Latency, Pause Times Pause times are not affected, if not tending to be slightly faster. Pause times typically decrease due to a shorter “Merge remembered sets” phase because there is no work to do for the remembered sets for the young generation in the common case - they are always already on the card table. Even if the refinement table needs to be merged into the card table it is extremely fast and is always faster than merging remembered sets for the young gen in my measurements. This work is linearly scanning some memory instead of many random writes, so this is embarassingly parallel. The cards created during garbage collection do not need to be redirtied, so another garbage collection phase that was removed. Code Size During review a colleague mentioned that the change reduces total code size by around 5%. Summary This change removes the need for a large part of G1’s write barrier using a dual card table approach to avoid fine-grained synchronization, increasing throughput significantly for some applications. Overall, I’m quite satisfied with the change - after many years thinking about and prototyping solutions to the problem without introducing some “G1 throughput mode” that would have huge implications on maintainability (basically another garbage collector) or making G1 unnecessarily complex this seems a very good solution, taking the advantages of these throughput barriers without too many drawbacks. A lot of people helped with this change, my thanks. Hth, Thomas

2025/2/21
articleCard.readMore

JDK 23 G1/Parallel/Serial GC 变更详解

鉴于 JDK 23 已进入第二阶段封版期,本文将照例为您简要介绍 OpenJDK 中 Stop-The-World 垃圾收集器的变更。 与上一版本相比,JDK 23 在 GC 领域的更新相对温和,但在文末我将提及一些值得期待的新进展 :) JDK 23 中整个 Hotspot GC 子组件的完整变更列表可在此查看,截至撰稿时共约 230 项变更已解决或关闭,未见特别异常之处。 Parallel GC Parallel GC 近期最重要的变革是将原有的 Full GC 算法替换为更传统的并行标记-清除-压缩算法。 原算法在压缩过程中会覆盖对象头,但仍需对象尺寸来重新计算移动后对象的最终位置以更新引用。为解决对象头被覆盖后的尺寸获取问题,Parallel GC Full GC 在首阶段使用与记录存活对象起始位置相同大小的位图来存储存活对象的结束位置。当需要获取某个存活对象的尺寸时,算法会从该对象在位图中的起始位置向前扫描,寻找下一个置位点并通过计算差值得到尺寸。 这种方式效率低下,且每次查找移动后对象的最终位置时都需执行此操作。JDK-8320165 重新揭示该操作的时间复杂度实际与存活对象数量的平方成正比。尽管针对此问题曾进行过多次相关及非相关优化,但如 JDK-8320165 所示,未能彻底解决问题。 通过 JDK-8329203,我们将 Parallel GC 的特殊算法替换为基本等同于 G1 的并行 Full GC 算法,后者不存在此类问题。同时移除了第二个结束位置位图(该位图占用 Java 堆内存的 1.5%),实测表明整体性能保持稳定(并在问题场景下大幅提升)。 通过 JDK-8325553 降低了区域存活字节计数器的竞争冲突:不再让所有线程立即更新全局计数器,而是让每个线程在仅遇到同一区域集合内的存活对象时,在本地记录区域存活信息。 Serial GC Serial GC 代码的清理与重构工作持续推进。 G1 GC JDK 23 解决了长期存在的 JDK-8280087 问题:G1 在引用处理期间未能按需扩展内部缓冲区,反而提前终止并返回无实际帮助的错误信息。 性能改进如 JDK-8327452 和 JDK-8331048 提升了暂停时间表现并降低了 G1 收集器的本地内存开销。 All (STW) GCs 此前我们引入了专用的内部填充对象。社区指出 jdk.vm.internal.FillerArray 并非合法的数组类名,部分工具会因此对这类"变长常规对象"报错。 通过 JDK-8319548,填充数组的类名已更改为符合规范的 [Ljdk/internal/vm/FillerElement;,并已反向移植至 JDK 21 及更高版本。 What’s next 降低 G1 记忆集存储需求仍是重点议题:早期构想是为多个区域使用统一记忆集,消除同组区域内存储记忆集条目的需求。 下图展示现行机制:区域(顶部大圆角方框,"Y"代表年轻代区域,"O"代表老年代区域)关联着记忆集(区域下方的垂直小方框)。条目指向本区域外的大致位置,这些位置可能存在指向本区域的引用。 年轻代区域始终关联记忆集,因为移动其中的存活对象时需要定位待修正的入引用,且每次垃圾回收时它们都会被转移。值得注意的是,不同区域的记忆集条目可能指向老年代区域中的相同位置。 如前所述,年轻代区域总是同时被转移,这意味着指向相同位置的所有记忆集条目都是冗余的。初始变更实现了该构想:为所有年轻代区域使用统一记忆集,自动消除冗余条目。这既节省内存,又减少垃圾回收时过滤这些条目的时间。 技术层面,合并记忆集也消除了存储组内区域间引用条目的需求;但年轻代区域不会生成此类条目,故未展示且不贡献节省效益。 仅对年轻代记忆集应用此技术已能带来显著收益,下图是某大型应用的实测数据。图表显示变更前与变更后的记忆集内存使用情况。特别值得注意的是前 40 秒(仅年轻代区域关联记忆集期间)记忆集内存使用减半。 当老年代区域开始关联记忆集后,因原型暂不支持老年代区域合并,改善幅度有所下降,但年轻代记忆集的节省效益依然显著。 这项合并年轻代区域记忆集的初始变更正待审核,而支持老年代区域分组合并的通用版本也在筹备中。 本发布周期内,G1 收集器的另一改进重点是写屏障变更:我们投入大量精力研究如何在不影响 G1 延迟特性的前提下,缩小某些场景下 G1 与 Parallel GC 的吞吐量差距,并已找到优质解决方案。 实现此目标的关键一步是 JEP 475: G1 延迟屏障扩展,该特性有望随 JDK 24 发布,为计划中的优化奠定基础。 仍有细节需要打磨,解决这些问题需要时间,全面落地更需要持续努力,未来数月将带来更多进展… Thanks go to… 照例感谢所有为 JDK 新版本做出贡献的开发者。期待下个版本再会。 Thomas

2024/7/22
articleCard.readMore

JDK 22 G1/Parallel/Serial GC 变更详解

又一次JDK发布,这个系列文章也迎来了新版本:这次JDK 22正式版即将发布,我将为大家介绍该版本中OpenJDK的STW垃圾收集器最新变化。;) 总体而言,我认为这个版本在STW收集器领域带来了相当重大的变更,例如JEP 423: G1区域固定。除了功能上的实际变化外,底层还进行了一些至少从技术角度很有趣的改进。同时Serial和Parallel GC的新生代收集性能都得到了提升。 JDK 22整个Hotspot GC子组件的完整变更列表在此,截至撰稿时已解决或关闭约230项变更。这并未完全反映垃圾收集器的变化,其中一些重要变更被归因于其他子系统,但对垃圾收集暂停时间产生了显著影响。 以下是JDK 22中Hotspot STW收集器的常规重点变更介绍: Parallel GC 在分代垃圾收集中,查找老年代到新生代的引用是一项重要任务。Parallel(和Serial)GC使用卡表实现此目的。首先,变异器将可能包含引用的卡表条目标记为"脏"。然后,在暂停期间算法扫描卡表寻找这些脏标记,并检查这些标记指示的对象(因为它们可能包含老年代到新生代的引用)。 Parallel GC(顾名思义)在扫描通常很大的卡表时使用多个并行线程。卡表扫描的工作分配机制将堆划分为64kB块。任何起始于该区域的卡标记对象都由该线程负责处理。 JDK-8310031实现了两项优化,带来更好的工作分配和性能提升: 大型对象数组的内容被分割到不同分区,现在不再只有拥有大型对象数组的线程在该对象中查找老年代到新生代引用。此前这可能导致单个线程独自遍历GB级别的大型对象。虽然卡扫描后还有基于工作窃取的额外工作分配机制,但与一开始就让多个线程分别检查对象各部分相比,这种窃取相对昂贵。 那个唯一的所有者线程总是遍历数组的所有元素,尽管脏卡已经指示了感兴趣的位置。因此处理线程经常遍历许多已知不包含任何老年代到新生代引用的引用。现在改为线程只处理大型对象数组中被标记为脏的部分。 旧行为在某些情况下导致线程扩展性倒挂,与G1收集器相比会产生非常长的暂停时间。 Parallel GC另一个与Java堆中大型数组对象相关的性能问题也已修复。Parallel GC现在使用与其他收集器相同的指数回跳块偏移表来查找对象起始位置,加速了这一过程并改善了整体暂停时间。 块偏移表解决了在卡表中查找卡片前对象起始位置的问题。一个应用场景是上述卡扫描期间,垃圾收集器需要快速找到Java对象的起始位置(无论对象是从该特定卡开始还是延伸至该卡),以便正确解码该对象(查找引用)。 对于每张卡(通常代表Java堆的512字节),都有一个块偏移表条目。该条目存储两种信息:要么是从该卡对应的堆地址回退多少字长是延伸至该卡的对象起始位置,要么是在前一张卡中没有对象起始位置,算法需要查看前一个BOT条目获取更多信息。JDK-8321013的变更将这个回跳值从简单的"查看前一张卡"改为使用以2为底的指数表示需要回退的卡片数量。 上图试图说明:顶部的Java堆对象展示了Java对象在堆中可能的排列方式;中间图显示BOT及其条目,以及每个条目的指向。其中一些指向堆中的对象起始位置,其他不包含对象起始位置的则指向前一个BOT条目(回跳值)。底部图显示新BOT编码下的回跳值(仅显示回跳引用),其中回跳值不一定指向前一个BOT条目,而是指向对象内当前条目之前的某个远处条目。这对于最右侧的大型Java对象尤其明显,从对象末尾地址走到起始位置可能需要很多步,而使用新编码后垃圾收集算法只需几步。 这显著减少了查找大型对象起始位置时的内存访问量,提升了性能。 Serial GC JDK-8319373基于JDK-8310031中新增的Parallel GC代码优化了Serial GC的卡扫描代码(查找脏卡)。如果脏卡较少,这也显著缩短了新生代收集时间。 大量工作用于清理Serial GC代码,移除了在JDK 14/JEP 363中移除的并发标记清除收集器所共享的死亡代码和抽象层。 G1 GC 以下是G1在JDK 22中面向用户的变更: G1现在(JDK-8140326)会在下一次任意类型的垃圾收集中回收疏散失败的区域。这提高了G1收集器应对老年代被疏散失败区域淹没的韧性。 主要用例是区域固定:尝试疏散被固定的区域会导致疏散失败,将受影响区域移至老年代。如果没有措施在区域解除固定后立即回收这些通常可快速回收的区域,它们会导致老年代区域大量堆积,在最坏情况下引发更多垃圾收集甚至不必要的Full GC。 显然,这也有助于回收因内存不足无法复制对象而导致的疏散失败区域的空间。 另外,此变更移除了G1之前(自我施加的)限制,即只有特定类型的新生代收集才能回收老年代区域的空间。现在任何新生代收集在满足某些条件时都可以疏散老年代区域。 移除G1中GCLocker使用的漫长旅程随着JDK-8318706的集成和JEP 423: G1区域固定的完成而告终。 简而言之,之前如果应用程序在通过JNI交互时使用Get/ReleasePrimitiveArrayCritical方法访问数组,则无法进行垃圾收集。此变更修改了垃圾收集算法,将这些对象保留在原位,"固定"它们并标记相应区域,但允许疏散其他任何区域或固定区域内的非基本类型数组。后一项优化是可行的,因为Get/ReleasePrimitiveArrayCritical只能锁定基本类型数组对象。 现在Java线程永远不会因G1的JNI代码而停滞。 在JDK-8314573中,Remark暂停期间的堆大小调整有细微变化,使调整更加一致。堆大小调整现在基于-XX:Min/MaxHeapFreeRatio计算堆大小变化,不考虑Eden区域。由于Remark暂停可能发生在变异器阶段的任何时间,先前行为使得堆大小变化高度依赖当前Eden占用率(即应用程序在Remark暂停发生时处于变异器阶段的进度,用于计算的空闲区域数量可能差异很大,导致堆大小调整不同)。 这导致堆大小调整

2024/2/6
articleCard.readMore

什么是并发撤销周期

最近有人询问日志中并发撤销周期消息的含义——本文将解释这一优化机制及其对应用程序的影响。 引言 G1垃圾收集器会与应用程序并发执行全堆存活性分析(标记)。当G1检测到老年代Java堆占用率达到初始堆占用百分比(IHOP)时,该过程便会启动。IHOP值基于并发标记的典型时长和老年代分配速率动态计算得出。 当垃圾收集器检测到老年代堆占用率达到该IHOP值时,下一次垃圾收集将进入并发开始的垃圾收集暂停阶段,随后进行并发标记。完成后,G1会启动混合阶段,收集老年代堆区域。 问题 并发标记周期可能耗时较长且占用CPU资源。例如下方日志片段显示,G1需要约4.2秒完成并发标记周期。该过程包括遍历老年代对象图以及为后续老年代转移做准备(详见此文)。 [15,431s][info ][gc ] GC(12) Pause Young (Concurrent Start) (G1 Evacuation Pause) 11263M->10895M(20480M) 153,495ms [15,431s][info ][gc ] GC(13) Concurrent Mark Cycle [15,431s][info ][gc,marking] GC(13) Concurrent Scan Root Regions [15,431s][debug ][gc,ergo ] GC(13) Running G1 Root Region Scan using 6 workers for 41 work units. [15,507s][info ][gc,marking] GC(13) Concurrent Scan Root Regions 75,589ms [15,507s][info ][gc,marking] GC(13) Concurrent Mark [15,507s][info ][gc,marking] GC(13) Concurrent Mark From Roots [18,731s][info ][gc,marking] GC(13) Concurrent Mark From Roots 3224,174ms [18,731s][info ][gc,marking] GC(13) Concurrent Preclean [18,731s][info ][gc,marking] GC(13) Concurrent Preclean 0,038ms [18,734s][info ][gc ] GC(13) Pause Remark 11287M->11287M(20480M) 2,292ms [18,734s][info ][gc,marking] GC(13) Concurrent Mark 3226,866ms [18,734s][info ][gc,marking] GC(13) Concurrent Rebuild Remembered Sets and Scrub Regions [19,687s][info ][gc,marking] GC(13) Concurrent Rebuild Remembered Sets and Scrub Regions 953,419ms [19,688s][info ][gc ] GC(13) Pause Cleanup 11447M->11447M(20480M) 0,523ms [19,688s][info ][gc,marking] GC(13) Concurrent Clear Claimed Marks [19,688s][info ][gc,marking] GC(13) Concurrent Clear Claimed Marks 0,044ms [19,688s][info ][gc,marking] GC(13) Concurrent Cleanup for Next Mark [19,695s][info ][gc,marking] GC(13) Concurrent Cleanup for Next Mark 6,145ms [19,695s][info ][gc ] GC(13) Concurrent Mark Cycle 4263,187ms 如果在并发开始暂停后,启动并发标记周期的条件不再满足(即老年代堆占用率低于启动阈值)会怎样?G1不会启动完整的并发标记周期,因为内存占用情况表明无需执行该操作。 问题在于暂停期间在并发开始垃圾收集暂停中进行了某些全局更改。这正是并发撤销周期的设计目的。 背景 并发开始垃圾收集暂停与常规年轻代垃圾收集暂停类似。在常规年轻代收集暂停中,G1会选择要收集的区域(整个年轻代和部分候选收集集),并扫描根位置中对这些区域的引用。被引用的对象及其追随者会被转移。同时,G1还会对巨型对象执行有限且保守的存活性分析,尝试(急切地)回收已死亡的巨型对象。完成清理后,应用程序继续运行。 并发开始暂停与常规年轻代收集暂停的区别较小: 在扫描虚拟机根位置时,G1会在标记位图中标记指向老年代的位置。例如,G1会标记线程栈变量中指向老年代对象的引用。 G1会确定根区域。这些是可能包含指向老年代引用的内存区域(通常是完整区域)(这些引用不会被遍历标记)。根区域与上述根位置类似,但为了节省时间,在垃圾收集暂停期间不会在标记位图中标记其引用对象。 随后的标记周期中的并发根区域扫描阶段会扫描根区域中的存活对象,查找指向老年代的此类引用,并在标记位图中标记其引用对象。这类区域的例子包括存活区,以及将在下一次垃圾收集中(可能在标记期间)收集的候选收集集区域。后者的例子包括转移失败的区域。 候选收集集区域的转移和上述急切回收意味着在某些情况下,垃圾收集暂停后老年代占用率可能显著降低。 变更 JDK-8240556实现了并发标记周期的短路机制。如果由于大对象急切回收导致老年代堆占用率在并发开始暂停期间降至IHOP值以下,G1将不再执行如下方状态图所示的完整标记周期,而是采取右侧蓝色箭头所示的短路路径。 上述日志消息显示,跳过的状态可以显著节省CPU使用量(和时间)。 剩余状态构成如下所示的并发撤销周期: 这些状态会反映在相应的日志输出中。以下示例展示了并发撤销周期的日志片段: [6278.870s][info][gc ] GC(54) Concurrent Undo Cycle [6278.870s][info][gc,marking] GC(54) Concurrent Clear Claimed Marks [6278.886s][info][gc,marking] GC(54) Concurrent Clear Claimed Marks 0.016ms [6278.886s][info][gc,marking] GC(54) Concurrent Cleanup for Next Mark [6278.948s][info][gc,marking] GC(54) Concurrent Cleanup for Next Mark 77.814ms [6278.948s][info][gc ] GC(54) Concurrent Undo Cycle 78.038ms G1只需执行并发清除已声明标记和并发下次标记清理阶段:前者重置关于使用中类加载器的内部信息,后者重置用于标记虚拟机内部数据结构中根引用的标记位图。其他所有步骤都被跳过。 请注意,并发下次标记清理阶段可能仍会耗费相当长的时间,但与常规并发标记周期相比,其完成速度应快得多。JDK-8316355中记录了一些可能更快的替代方案,如果您有兴趣参与改进,欢迎联系。:) 总结 在G1中,并发撤销周期是为避免并发标记周期中最耗CPU部分而进行的优化结果,无需担心。如果出现这种情况,仅表示G1发现无需执行完整的全堆并发标记。并发撤销周期随JDK 16引入,因此如果您正在运行较新版本的JDK,可能已经遇到过此类情况。 下次再见, 托马斯

2023/9/28
articleCard.readMore

JDK 21 G1/Parallel/Serial GC 变更

JDK 21 GA正按计划推进——同时推出的还有本次发布版中OpenJDK的Stop-the-World垃圾收集器变更更新。;) 该版本将被大多数(若非全部)OpenJDK发行版指定为LTS版本。请同时查阅之前关于JDK 18、JDK 19和JDK 20的博客文章,以了解从上个LTS版本JDK 17升级在垃圾收集领域带来的具体改进。 回到本版本,在探讨Stop-the-World垃圾收集器之前,本次发布中垃圾收集方面最具影响力的变革当属分代式ZGC的引入。这是对Z垃圾收集器的改进,它像Stop-the-World收集器一样维护年轻代和老年代对象,将垃圾收集工作集中在通常产生最多垃圾的区域。这显著降低了维持应用运行所需的Java堆内存和CPU开销。要启用分代式ZGC,除了-XX:+UseZGC标志外,还需传递-XX:+ZGenerational标志。 除此之外,JDK 21整个Hotspot GC子组件的完整变更列表在此处,截至撰稿时已解决或关闭约250项变更,较之前版本略有增加。 以下是Hotspot的Stop-the-World收集器在JDK 21中的主要变更概览: Parallel GC Parallel GC没有显著变更。 Serial GC Serial GC没有显著变更。 G1 GC 以下是G1在JDK 21中面向用户的变更: 在Full GC期间,G1现在会移动巨型对象以降低因区域级碎片导致OOME的可能性。虽然JDK 20及以下版本严格禁止移动巨型(大)对象,但这一限制在最终挽救式Full GC中已被解除。最终挽救式Full GC发生在常规Full GC未能提供足够(连续)空间满足当前分配之后(JDK-8191565)。 这一改进的最终挽救式Full GC可以避免某些先前出现的内存耗尽情况。 另一项与Full GC相关的改进(JDK-8302215)确保Full GC后的Java堆更加紧凑,略微减少了碎片。 所谓的Hot Card Cache在JDK 20的并发 refinement 变更后被发现(至少)没有影响,因此已被移除。该数据结构的初衷是收集应用程序频繁修改引用的位置。这些位置被放入Hot Card Cache中,以便在下次垃圾收集时一次性处理,而不是持续并发地进行 refinement。移除该功能的意义在于其数据结构原本需要占用相当大量的本地内存(即Java堆大小的0.2%)来运行。现在这些本地内存可用于其他用途。 相应的选项-XX:+/-UseHotCardCache现已过时。 在具有数千线程的应用程序中,先前版本中暂停时间的相当一部分可能花费在销毁和设置(每线程)TLAB上。通过JDK-8302122和JDK-8301116,这两个阶段已分别实现并行化。 预防性垃圾收集功能已因先前所述原因,通过JDK-8297639被完全移除。 What’s next JDK 22的工作已启动,立即着手处理未及纳入本版的功能。以下是一些已集成或正在开发中的有趣变更简短列表。照例,不保证它们会出现在下个版本中;)。 提升G1对疏散失败区域淹没老年代的恢复能力的变更已评审一段时间(JDK-8140326, PR),即将推送。 该变更移除了G1中先前仅限特定年轻代收集类型才能回收老年代区域空间的限制。此后,任何年轻代收集只要满足常规要求,均可疏散老年代区域。 目前这有助于在疏散失败后快速回收空间,但除其他有趣构想外,此变为固定区域在解除固定后实现高效及时疏散提供了必要基础设施。 基于上述变更,区域固定功能仅差一个PR即可实现。 针对GCLocker长期存在的阻碍线程进展并导致不必要内存耗尽异常问题的修复正在开发中(JDK-8308507),且已进入评审阶段。 未来数月将有更多进展 :) Thanks go to… 感谢所有为又一个伟大JDK版本做出贡献的人们。下个版本再见 :) Thomas

2023/8/4
articleCard.readMore

JDK 20 G1/Parallel/Serial GC 变更详解

又一款按计划发布的JDK版本——JDK 20正式版即将发布。借此机会,我们总结下JDK 20版本中Hotspot的Stop-the-World垃圾收集器的改进与变化。 本次发布虽未包含垃圾回收领域的JEP提案,但分代式ZGC的JEP近期刚达到候选状态,或许能在JDK 21中与大家见面:) 此外,JDK 20中整个Hotspot GC子组件的完整变更清单可在此查看,显示共有约220项变更已解决或关闭。 Parallel GC Parallel GC最显著的改进是在Full GC期间对跨压缩区域对象的处理实现了并行化JDK-8292296。 相较于以往在最后使用单线程阶段迭代修复这些对象,现在工作线程会收集跨越本地压缩区域的对象并自主处理。贡献者M. Gasson报告称,在特定场景下Full GC暂停时间减少了20%。 Serial GC Serial GC未有显著变更,主要进行了大量代码清理工作。 G1 GC 总体而言,JDK 20实现了上一份报告中“未来计划”清单上的所有项目——甚至更多:) JDK-8210708通过移除一个覆盖整个Java堆的标记位图,将G1本地内存占用降低了约Java堆大小的1.5%。《G1中的并发标记》这篇博客对此变更进行了深入探讨。 该文章同时表明原版G1论文中关于并发标记的信息已过时,论文中准确描述当前G1特性的内容所剩无几。 实际上这篇博客在某些方面也已过时:JDK-8295118将并发标记准备阶段中有时较耗时的“清除已声明标记”操作移出了并发起始垃圾回收暂停。启用gc+marking=debug日志后,日志中将出现名为“并发清除已声明标记”的新阶段。 为未来G1的区域固定功能做准备,JDK-8256265降低了处理用户固定区域(或因Java堆空间不足无法疏散的区域)时的并行化粒度。现在任务粒度从按线程分配整个区域变为分配区域片段。 这使得线程间工作分配更均衡,显著减少了无谓的等待时间。 JDK-8137022使 refinement 线程控制更具自适应性。此前在垃圾回收暂停期间,G1会基于用户期望在垃圾回收暂停中用于refinement卡片的允许时间(通过-XX:G1RSetUpdatingPauseTimePercent选项设置)、距下次暂停的最新时间间隔以及大量经验值,计算离散阈值来激活(或停用)特定refinement线程。 由于未直接观察应用行为(如近期工作负载流入量和refinement线程处理速率,以及距下次垃圾回收的预期时间),refinement线程控制会倾向于以突发方式唤醒并执行过多不必要的工作,以避免在回收暂停时遗留过多任务。这种行为不仅浪费refinement线程的CPU周期,还存在另一个缺点:Java堆上遗留的未refinement卡片过少。这听起来是好事,但低于特定水平时反而可能适得其反。每当出现新的未访问卡片时,写屏障需要执行的工作量比将卡片保留在队列中待后续处理时更多。 refinement线程的额外活动不仅会消耗CPU资源,G1还会重复refinement相同卡片,而未能显著减少暂停期间剩余的工作量。 通过此变更将mutator活动纳入考量后,G1能更好地在暂停间分配refinement线程活动,并将更多卡片长时间保留在refinement任务队列中,从而降低新卡片生成速率,更确定性地实现用户意图(即-XX:G1RSetUpdatingPauseTimePercent)。最终这通常能从应用中节省更多CPU周期,提供更优的吞吐量。 多个与旧版refinement控制相关的G1垃圾收集器选项已被废弃,启动时使用这些选项将导致VM退出。发布说明详述了这些选项。 G1使用本地分配缓冲区(PLAB)来降低垃圾回收暂停期间的同步开销。PLAB会根据近期分配模式调整大小,以减少垃圾回收暂停结束时这些缓冲区中的未使用空间。当分配需求较低时它们会收缩,反之则扩张。这种基于每次回收暂停的适配机制对于垃圾回收间分配量相对稳定的应用效果良好,但在分配极度突发的情况下表现不佳。在垃圾回收期间分配量较少的几次回收后,PLAB会变得过大导致空间浪费;或在分配激增时过小,导致重载

2023/3/14
articleCard.readMore

Ljdk.vm.internal.FillerArray 案例解析

在撰写 JDK 19 更新博客文章时,我曾认真考虑过介绍可能出现在堆转储中的"新"对象(jdk.vm.internal.FillerObject 和 jdk.vm.internal.FillerArray)。虽然起初觉得它们并不那么引人关注,但看来大家很快就注意到了这些对象。本文最终将详细说明它们的作用。 更新 2024/07/22:自 JDK 21 起,后者的类名已改为 [Ljdk/internal/vm/FillerElement;。 背景 在 Hotspot 中,Java 堆有一个相当有趣的特性,我们称之为可解析性。这意味着可以线性遍历堆的各个部分(根据垃圾收集器的不同,这些部分可能被称为区域、空间或代),从各自的底部地址到顶部地址。垃圾收集器保证在 Java 堆中的一个对象之后,总是跟着另一个某种类型的有效 Java 对象。 每个对象头都包含对象类型(即类)的信息,这使得可以推断出该对象的大小。 这一特性在多个方面得到利用。以下是一个非 exhaustive 的列表: 堆统计(jmap -histo):只需从堆所有部分的底部开始,遍历到顶部,并收集统计信息。 堆转储 某些收集器在年轻代进行部分堆回收时,要求堆至少部分可解析:它们使用不精确的记忆集来记录存在有趣引用的近似位置。这些记录的区域(记忆集条目)在垃圾收集期间需要快速遍历。这篇文章在引言部分更详细地解释了其工作原理。 那么,Java 堆是如何变得不可解析的呢?毕竟对象已经是线性分配的了? 有两个(可能更多)原因: 第一个是 LAB(本地分配缓冲区)分配:线程并非对所有内存分配都与其他线程同步,而是分配较大的缓冲区用于本地分配,无需与其他线程同步。这些 LAB 是固定大小的——如果一个线程无法将分配放入当前的 LAB,那么在获取新的 LAB 之前,剩余区域需要被正确格式化。 第二个是类卸载。类卸载会使那些类已被卸载的对象变得不可解析,因为它们的类信息将被丢弃。一些收集器通过在类卸载后压缩堆来避免这个问题,从而从可解析的 Java 堆区域中移除所有这些无效对象。 在 JDK 19 之前,Hotspot VM 使用 java.lang.Object 的实例和整型数组([I)来重新格式化("填充")堆中的这些空洞。前者仅用于填充最小的空洞,而其他所有情况都使用整型数组。 也就是说,在包含非存活数据的完整堆直方图或堆转储中,你可能已经注意到大量的 java.lang.Object 和 [I 实例,这些实例并未从你的程序中的任何地方引用。 以下是一个使用较旧 JVM 运行某个程序的 jmap -histo <pid> 示例: num #instances #bytes class name (module)

2022/9/26
articleCard.readMore

JDK 19 G1/Parallel/Serial GC 变更说明

JDK 19 正式版即将发布。请允许我再次总结该版本中 Hotspot 停止世界垃圾收集器的变化与改进——特别是 G1 与 Parallel GC。:) 总体而言,本次发布没有专门针对垃圾回收领域的 JEP。但有两个重大变更需要垃圾回收领域的支持:首先是 JEP 425:虚拟线程(预览版),其次是 JDK 8 维护版第 4 版。 虚拟线程引入了包含线程栈的 Java 堆对象。这类新型对象需要所有垃圾回收算法正确处理。但主要复杂性在于:卸载这些对象引用的陈旧编译代码需要垃圾收集器与代码缓存清理器("编译代码的 GC")进行相当复杂的协同操作,至少在 JDK 19 中如此。改进工作已经启动,例如让垃圾收集器完全掌控方法卸载或解决 Loom 过度激进清理的问题。 当未通过预览选项启用 Loom 时,这些问题不会出现。 JDK 8 发布了维护版本。此处详述的变更移除了 java.lang.ref.Reference 处理中此前可能引发崩溃的未定义行为。该变更的副作用是垃圾收集器中的引用处理能得到一定改善。 除此之外,JDK 19 整个 Hotspot GC 子组件的完整变更列表在此,显示共有约 200 项变更被解决或关闭。上述两个项目留下了它们的印记 :) 简要查看 ZGC 可知本次发布的变更较小:JDK-8255495 增加了部分 CDS 支持(类数据被归档,但堆对象没有),此外还有一些错误修复。一段时间以来,重点一直放在分代式 ZGC 上——您可在此关注开发进展,敬请期待。 Parallel GC Parallel GC 现支持归档堆对象。该功能通过 JDK-8274788 添加。 JDK-8280705 改进了完全收集第一阶段的工作分配,有时能带来显著的性能提升。 Serial GC Serial GC 无明显变更。 G1 GC 除 JDK-8280396 为 G1 完全收集实现了上述 Parallel GC 的相同优化外,无明显改进。 JDK 19 在发布流程后期出现了本地内存使用回归,导致修复无法集成。发布说明提供了临时解决方案。该回归问题已在 JDK 19.0.1 及后续版本中解决。 未来展望 我们早已开始 JDK 20 的工作。以下是当前开发中值得关注的有趣变更简表。其中许多其实已经集成——虚拟线程和 JDK 8 MR4 占用了相当多资源,导致部分变更未能赶上 JDK 19 的功能冻结。除这些外,无法保证它们会进入 JDK 20。它们将在完成后集成 ;-) 随着 JDK-8210708 集成至 JDK 20,G1 本地内存占用显著减少,移除了一个覆盖整个 Java 堆的标记位图。G1 并发标记博客文章详细讨论了该变更及其影响。 该变更最终实现了原本为 JDK 19 规划的"原型(计算值)"行的本地内存占用目标。 另一项提升疏散失败时并行性(及性能)的重要变更已集成至 JDK 20,这是未来支持 JEP 423:G1 区域固定的关键一步。 并发 refinement 控制正作为 JDK-8137022 的一部分进行修订。通常,并发处理 refinement 卡片的速率现在更平稳。副作用是 G1 会将更多脏卡片长时间留在队列中,降低新脏卡片生成速率,最终减少总体并发 refinement 工作量。这可提升应用程序吞吐量。 突发提升对象的应用可能出现长达数秒(而非 1-200 毫秒)的暂停时间峰值。特别是大型 Aarch64 机器易受此影响。这由晋升分配缓冲区(PLAB)大小未训练(或训练不当)导致。JDK-8288966 通过在垃圾回收期间适度激进地调整 PLAB 大小来解决该问题。 正在研究改进 G1 预测机制以更好遵守暂停时间目标(-XX:MaxGCPauseMillis)。目标是减少超调(降低延迟峰值)与欠调,在保持暂停时间目标的同时减少垃圾回收次数。在某些应用中,这可显著减少垃圾回收总时间,为应用释放更多时间。已生效部分包括如 JDK-8231731。 近期重新出现呼声,要求实现一些长期改进以使 Hotspot 垃圾收集器更支持运行时配置(例如 G1 的 JDK-8236073、JDK-8204088)并更友好对待其他租户(例如 JDK-8238687),从而提升容器兼容性。 更多内容敬请期待 :) 致谢 感谢所有为伟大 JDK 版本做出贡献的同行。下个版本再见 :) Thomas ```

2022/9/16
articleCard.readMore