[SOLVED] Swap: Treated like cache hierarchy or filled consecutively?

tonnz · 2018-11-17 09:01:30

When setting up multiple swap devices of different speeds, and setting up the priorities accordingly, will the kernel simply fill one swap device after the other, or will more complex algorithms be used to move less frequently used pages to the slower device?

The wiki is not entirely clear on this:

The system will use swap areas of higher priority before using swap areas of lower priority.

It is generally known that infrequently used pages are moved from memory to swap first, but i could not find anything anywhere that describes if or how this works between swap devices.
I'm guessing this question is best asked in some kernel mailing list, just asking here first in case somebody knows (so i don't have to bother any kernel dev).

Thanks

EDIT: Just realizing that if the kernel simply filled one device after the other, all the infrequently used pages would end up in the highest-priority swap device, while the next more frequently used pages would end up in the lower-priority device. This would be really bad.

Last edited by tonnz (2018-11-17 13:18:10)

mpan · 2018-11-17 11:24:17

Pages in swap are not more or less frequently used. They are approximately unused. If pages, that are used in any way, end in swap, it means that your machine has not enough RAM and you should simply buy more RAM.

Because even an unused page may need to be pulled back into RAM at some point, it might be desirable to first fill the fastest device. This way swap gets filled from the fastest medium to the slowest one; therefore the distribution of pages across the swap media is making it most probable that the fastest device will be used with greater probability.

However, I consider this to be of importance only in the distant past. The time at which prehistoric dinosaurs roamed Earth, swap-in and swap-out was frequent, the understanding of what swap is was low and — even worse — some people were trying to “cheat” and improperly treated swap as “free RAM”. Nowadays wrtes and reads to/from swap on a healthy system are so infrequent, that the difference should not be noticeable. If the system reads from swap less than 1/10000 of time, it doesn’t really matter if a medium is 5% faster or slower. Therefore swap priorities seem to be a remnant of old times.

--- edit

Let’s analyze a simple case. 6GiB RAM, three media used for swap. The media, from fastest to slowest, are: swap-dog, swap-cat and swap-bug. Let’s say 4GiB is currently used and system decides to move pages to the swap. Imagine three distinct cases, which require 1GiB, 2GiB and 3GiB to be moved to swap. At this point you should already notice that moving 2GiB to swap involves moving 1GiB to swap, and the 3GiB case includes the other two.

Here is a table that shows, what is the situation at the end of each scenario:

╔═══════════╦═════════════════════════════════╤══════════════════════════╗
║ memory    ║             R A M               │         s w a p          ║
║           ║                                 │  -dog  ╷  -cat  ╷  -bug  ║
╟───────────╫─────────────────────────────────┼────────┼────────┼────────╢
║ size[GiB] ║                                 │        │        │        ║
╟────────   ║ 0    1    2    3    4    5    6 │ 0    1 │ 0    1 │ 0    1 ║
║ case    ╲ ║ ┆    ┆    ┆    ┆    ┆    ┆    ┆ │ ┆    ┆ │ ┆    ┆ │ ┆    ┆ ║
╠═══════════╬═════════════════════════════════╪════════╪════════╪════════╣
║ initial   ║ █████████████████████░░░░░░░░░  │ ░░░░░  │ ░░░░░  │ ░░░░░  ║
╟───────────╫─────────────────────────────────┼────────┼────────┼────────╢
║ #1 (1GiB) ║ ███████████████░░░░░░░░░░░░░░░  │ █████  │ ░░░░░  │ ░░░░░  ║
╟───────────╫─────────────────────────────────┼────────┼────────┼────────╢
║ #2 (2GiB) ║ ██████████░░░░░░░░░░░░░░░░░░░░  │ █████  │ █████  │ ░░░░░  ║
╟───────────╫─────────────────────────────────┼────────┼────────┼────────╢
║ #3 (3GiB) ║ █████░░░░░░░░░░░░░░░░░░░░░░░░░  │ █████  │ █████  │ █████  ║
╚═══════════╩═════════════════════════════════╧════════╧════════╧════════╝

Observe that, in all three cases, swap-dog is being used. What it means, is that on average pages will be writen to and read from swap-dog, the fastest one. Then, if still more data has to be swapped-out, swap-cat is also used. And it is used in two cases, which makes it less probable than swap-dog, but still we end up with the medium-speed device used more often than swap-bug (the slowest one).

The way, I have drawn this table, makes probability distribution of devices being used easy to notice. They are, however, not to scale. In reality case #1 is much more probable than the other two, and the case #2 is still more likely than the extreme #3 case. Which means that, in reality, the distribution is more like this:

#1 ┄ █
     █
     █
     █
     █
     █
     █
     █
     █
     █
     █
     █
     █
     █
     █
     █
#2 ┄ █  █
#3 ┄ █  █  ▁

     d  c  b
     o  a  u
     g  t  g

What it means, is that — in long run — unused pages will be on the fastest medium in most cases.

I guess that what misled you was an assumption that pages somehow distinct from each other and you should track their history separately. Which means that indeed you may encounter some pages that will take a very inefficient path. But existence (∃) of a particular case doesn’t imply what’s happenineg in all cases (∀). On large scale, in a long run, the situation should be approached from the statistical point of view. And this is what I’ve painted above.

____
¹ In reality the size of swapped-out fragments will not be that nice. But I want to paint some picture. The statement still holds for any other amounts of memory of course, but to draw it precisely I would have to draw all 1722679482537250971241828775957480079360 cases… and this would not be very useful. ;)

Last edited by mpan (2018-11-17 12:55:14)

tonnz · 2018-11-17 13:16:44

Thanks for your reply, though it does not answer my question sadly. Just read your edit. Holy crap, thank you! Your graph shows what is essentially a best-case scenario for my case. Marking the thread as solved!

I guess that what misled you was an assumption that pages somehow distinct from each other and you should track their history separately.

Yes, this was my intuition at the beginning. Now that i think of it, even if there isn't any complex handling of swap priorities, the highest priority device would on longer timescales contain the more frequently accessed pages, simply because they are swapped out/in more frequently and so get more chances to "catch a spot" on the fastest device. If i understand correctly.

To reply to what you wrote before your edit:
I am working with zram as swap, testing on an old machine with 1 gb of ram. The compression algorithm in use is lz4, which has a throughput of around 500mb/s here. The raw memory bandwith is around 1gb/s. Compression ratios are around 0.3-0.5 (uncompressible pages are written to a backing device).
Testing it with an in-memory custom kernel compilation (fits in ram, compared against zram+artificial memory use), there is barely a measurable compile time increase, because all of the swapped pages contain source code which is accessed relatively infrequently.
This comes temptingly close to "free ram".
My idea was using another zram swap device with zstd as another "level of cache" on top. Now i just need to find out how precisely multiple swaps will be handled by the kernel (Thanks again for your effort and definitive answer!)
Regarding swap priorities being a thing of the past: With multistream zram and nvme devices, i think it could make sense to have a similar setup for very memory-intensive workloads. Especially considering RAM prices. Ie using zram as a step between RAM and flash storage, both speeding things up and relieving the flash storage from most of the writes.

Last edited by tonnz (2018-11-17 13:17:03)

mpan · 2018-11-17 13:43:31

Regarding the last paragraph of your answer: keep in mind that swap is not a substitute for RAM. This is not how swap should be used and it will not work on modern setups¹. Swap is not “free RAM”. No, if you have not enough RAM, there is no way to jump over that obstacle and “cheating” with a slow device providing more memory. You just need to buy more RAM: sorry, the reality is not always pleasant. Swap is used for completely different stuff. This may also be the source of confusion, because you are getting much more swap-in/out than you actually should.

I’m a bit surprised you are getting 500MB/s (assuming proper case), especially on an older computer. Encryption alone should get lower throughput. But maybe I’m overestimating what you mean by “old machine”. I’m working on a 10-year old machine myself, so I have a different perspective.

Note that, while you might have fallen into one wrong idea of not seeing this stuff from statistical point of view, don’t fall into the opposite extremum. It only tells what is the “typical” case and “typical” is not equivalent to “100%”. Even if that is 99.9%, you may actually be in that 0.1%: so understand the problem and think about it.
____
¹ Unless the software is written specifically to work well with slow media. But this is unseed in programs an average person encounters, and rare even in specialized software.

tonnz · 2018-11-17 14:45:28

mpan wrote:

Regarding the last paragraph of your answer: keep in mind that swap is not a substitute for RAM. This is not how swap should be used and it will not work on modern setups¹. Swap is not “free RAM”. No, if you have not enough RAM, there is no way to jump over that obstacle and “cheating” with a slow device providing more memory. You just need to buy more RAM: sorry, the reality is not always pleasant. Swap is used for completely different stuff. This may also be the source of confusion, because you are getting much more swap-in/out than you actually should.

That makes me wonder; besides being used as filecaches, how are nvme "accelerators" used in enterprise applications? I'm assuming custom software that accesses these devices directy? I remember a presentation (slide 27, "SPECjbb Performance") from IBM that benchmarked the performance of zswap (not zram) for Java applications, using hardware-accelerated compression (842) and lz4. It shows a very clear performance advantage once memory got full. Though i know zswap isn't too comparable to zram-swap, besides both compressing ram. I'm guessing the pages zswap compresses circumvent the classic swap logic?

I’m a bit surprised you are getting 500MB/s (assuming proper case), especially on an older computer.

Me too. I used lzbench to benchmark the performance (tested on the first 512mb of the linux 4.18 source tarball). Now testing it again with just the lz4 binary yields about 100-150mb/s, which seems much more realistic. The machine is an IBM Thinkcentre 8183 Pentium 4 (2.8ghz, no SMT, Northwood).

Note that, while you might have fallen into one wrong idea of not seeing this stuff from statistical point of view, don’t fall into the opposite extremum. It only tells what is the “typical” case and “typical” is not equivalent to “100%”. Even if that is 99.9%, you may actually be in that 0.1%: so understand the problem and think about it.

There's fortunately no real risk in my case. I intend to use this mostly as a way to compress tmpfs once it doesn't fit in RAM anymore, using zram for /tmp directly led to a number of problems, and is unecessary when the RAM isn't used elsewhere. So i don't worry about it too much One small zram-swap with lz4 isn't always enough (compiling Pale Moon from memory), and one big zram-swap with zstd is slow. I'm trying to find a compromise. Though all cases are much faster than compiling from the hard disk.

mpan · 2018-11-17 17:57:41

The presentation deals with a broken, badly configured system in which software has not enough RAM to run. Until it uses less than all the RAM used, it performs very well — which is quite surprising, because I would epxect it to start having problem earlier¹. Most of the chart deals with a condition in which system runs with more than 100% of RAM being in use. In other words: yeah, it deals with some problem, but the problem should never occur in a healthy system in the first place.

Why they needed to consider such a situation — I don’t know. Again, I suspect the situation I was mentioning already: someone is trying to “cheat” by not buying RAM they actually should buy. But this is not the use of swap. This is such a common misconception, that I already assume it without even thinking about different options :). Swap is there to let system free RAM from unused pages, so RAM could be used for useful stuff (cache). It is not to avoid having to buy RAM.
____
¹ But it may be cause by removal of datapoints below 10GiB.

Last edited by mpan (2018-11-17 17:59:09)

tonnz · 2018-11-17 19:25:24

mpan wrote:

The presentation deals with a broken, badly configured system in which software has not enough RAM to run.

I don't see what you base that on considering no system details beyond CPU + memory size and compression are presented. And for the amount of RAM, the entire point of the presentation is to show what happens when RAM limits are reached, and what can be done when it happens.

mpan wrote:

because I would epxect it to start having problem earlier

The JVM Heap Size does not necessarily reflect the actual amount of memory being used. You can see in the graph that the performance starts dropping around 11GB instead of 10, which indicates that the benchmark does not utilize the entire heap, which in a JVM could trash performance (exploding GC overhead).

mpan wrote:

but the problem should never occur in a healthy system in the first place.

I don't know if i read it in that presentation (has been a while) or somewhere else, but IBM advertises 842+zswap as a way to deal with unexpected overloads. That the system isn't healthy is a given.

mpan wrote:

Why they needed to consider such a situation — I don’t know.

See above

mpan wrote:

But this is not the use of swap. [...] Swap is there to let system free RAM from unused pages, so RAM could be used for useful stuff (cache). It is not to avoid having to buy RAM.

That's certainly the original, or rather historic, purpose of swap. Whether or not this still holds true today, i'd need a citation on that. And even if it does, here's an alternative use that has been proven to be useful. And since that presentation doesn't convince you: Currently compiling the entire LLVM/clang toolchain with all bells and whistles in-tree, total 'virtual' memory use is at 2849mb (on a 1gb machine), CPU utilization is at 100%, of which less than 3% is spent in the kernel (compressing/decompressing the memory pages would be done in the kernel). EDIT: And not a single byte of those 2849mb is on the hard disk.

Last edited by tonnz (2018-11-17 19:42:51)

mpan · 2018-11-17 21:28:28

tonnz wrote:

I don't see what you base that on considering no system details beyond CPU + memory size and compression are presented.

It is written in the very slide you’re mentioned.

tonnz wrote:

And for the amount of RAM, the entire point of the presentation is to show what happens when RAM limits are reached (…)

Which supports what I have written. There is not enough of RAM. Swap is not “free RAM” to deal with the lack of RAM in the system.

tonnz wrote:

The JVM Heap Size does not necessarily reflect the actual amount of memory being used.

Obviously, since performance drops at the point at which the size of heap equals the size of RAM, the charts deal with heap that is actually used, not just set in the JVM settings.

tonnz wrote:

I don't know if i read it in that presentation (has been a while) or somewhere else, but IBM advertises 842+zswap as a way to deal with unexpected overloads. That the system isn't healthy is a given.

OOM killer deals with it swiftly, unless you have swap.

tonnz wrote:

That's certainly the original, or rather historic, purpose of swap. Whether or not this still holds true today, i'd need a citation on that

I am not really sure, what kind of citation could I provide. That would require some kind of authority you trust. I have no idea, who could that be. To prove that it is not “just my opinion”, here you have a description from Chris Down. Since nowadays people think that swap should be completely abandoned, he argues that it still has an important role — and explains what swap is useful for.

tonnz wrote:

And even if it does, here's an alternative use that has been proven to be useful. And since that presentation doesn't convince you: Currently compiling the entire LLVM/clang toolchain with all bells and whistles in-tree, total 'virtual' memory use is at 2849mb (on a 1gb machine), CPU utilization is at 100%, of which less than 3% is spent in the kernel (compressing/decompressing the memory pages would be done in the kernel). EDIT: And not a single byte of those 2849mb is on the hard disk.

That’s a clear example of trying to (sorry for repeating this for N-th time in this thread, but it seems the message doesn’t get through) acquire “free RAM”, because the machine that is used has not enough of it.

tonnz · 2018-11-18 05:32:43

TL;DR
You haven't explained what is wrong with "trying to cheat" or "trying to aquire free RAM". Compressing pages helps with the 0.01% of the time when your server is unexpectedly overloaded, without killing any process. What is wrong with that? Why would i want to buy more RAM only to account for these extremely rare cases, which compressing RAM gets me through with almost no slowdown and at zero cost? This is what i don't understand. Why shouldn't i use swap in a "nonstandard" way? Is IBM doing a big mistake by recommending it, are they endangering their customers?

mpan wrote:

It is written in the very slide you’re mentioned.

Where? "10gb RAM, 2 core SMT4, Power7+, max_pool_percent=40"

OOM killer deals with it swiftly, unless you have swap.

How you can argue that killing processes is better than compressing infrequently used pages is beyond me. Sorry. In servers running critical applications, every process is started for a reason. Killing them is not an option. But thrashing is also not an option. Here is a way doesn't do either, but apparently it conflicts with some people's tastes.

To prove that it is not “just my opinion”, here you have a description from Chris Down.

Thanks. "With swap: We’re more resilient to temporary spikes, but in cases of severe memory starvation, the period from memory thrashing beginning to the OOM killer may be prolonged." The latter is not problematic if your swap device is very fast, and if the pages are accessed infrequently. We are not talking about swap on disks here, which this article is doing. We are talking about compressed memory and very fast flash storage. There is no thrashing, killing processes will never be necessary. Instead, the system remains responsive and performant, giving an administrator the opportunity to make the process-killing decision himself. Or the opportunity to simply wait, which also becomes an option with very fast swap.

That’s a clear example of trying to (sorry for repeating this for N-th time in this thread, but it seems the message doesn’t get through) acquire “free RAM”, because the machine that is used has not enough of it.

But again what is wrong with it? You have not explained it so far. Of course this isn't 'real' RAM. So what? Will my system explode? Will i get a million viruses? Will Torvalds get angry? I don't get it. Also for the Nth time, it works. It serves its purpose well. Should i stop the compilation because i have broken some kind of rules here? I showed you an easiliy reproducible example of using 300% memory at near 100% usermode CPU utilization. Tell me, from a technical perspective, why is this bad? Should i stop?

Sorry for the tone, but this is getting frustrating. Here are multiple ways in which compressing RAM can be useful, but the main response i get is "this in an attempt to get free RAM. This is cheating" with no technical explanation for how that is wrong in these specific cases.

Last edited by tonnz (2018-11-18 07:02:50)

mpan · 2018-11-18 19:04:26

Sorry, I’m unable to provide a technical explanation of why using a screwdriver to drive nails is not the right way to use that tool. /EOT

tonnz · 2018-11-19 13:45:22

mpan wrote:

Sorry, I’m unable to provide a technical explanation

Because there isn't any, since unlinke your joke

mpan wrote:

of why using a screwdriver to drive nails is not the right way to use that tool.

this works just like the developers of zram and zswap intended to. I suggest you direct your criticism at them if there's something you know and they don't. Likewise, /EOT

P.S: I'm still very thankful for the help you provided me at the beginning of this thread

Arch Linux

#1 2018-11-17 09:01:30

[SOLVED] Swap: Treated like cache hierarchy or filled consecutively?

#2 2018-11-17 11:24:17

Re: [SOLVED] Swap: Treated like cache hierarchy or filled consecutively?

#3 2018-11-17 13:16:44

Re: [SOLVED] Swap: Treated like cache hierarchy or filled consecutively?

#4 2018-11-17 13:43:31

Re: [SOLVED] Swap: Treated like cache hierarchy or filled consecutively?

#5 2018-11-17 14:45:28

Re: [SOLVED] Swap: Treated like cache hierarchy or filled consecutively?

#6 2018-11-17 17:57:41

Re: [SOLVED] Swap: Treated like cache hierarchy or filled consecutively?

#7 2018-11-17 19:25:24

Re: [SOLVED] Swap: Treated like cache hierarchy or filled consecutively?

#8 2018-11-17 21:28:28

Re: [SOLVED] Swap: Treated like cache hierarchy or filled consecutively?

#9 2018-11-18 05:32:43

Re: [SOLVED] Swap: Treated like cache hierarchy or filled consecutively?

#10 2018-11-18 19:04:26

Re: [SOLVED] Swap: Treated like cache hierarchy or filled consecutively?

#11 2018-11-19 13:45:22

Re: [SOLVED] Swap: Treated like cache hierarchy or filled consecutively?

Board footer