-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partial Images Feature is slower than not using it #1928
Comments
Thanks for trying this out! It's a bit unfortunate no one else had tried to measure this yet, and we lack CI coverage for a lot of this. Just glancing at the code offhand I'm pretty sure there's a lot more serialization happening in the main goroutine; e.g. it looks like the entire tar stream is copied to an O_TMPFILE and then checksummed? And there's a lot more processing going on than just "untar" - so I think we may be losing out on the natural parallelization from having a thread/goroutine reading from the network, another handling decompression, and another executing writes. This one may be better filed against containers/storage where things can be more focused; we can transfer the issue if e.g. @giuseppe agrees. |
Indeed the CPU is not doing much. It is doing more when the feature is off. Also worth considering the cost of additional requests, e.g., if I have chunks A, B, C and I want A, C, if B is smaller than say 1MB on a fast connection it might be worth bridging the range and grabbing B too. I'm not sure of the perf characteristics of multirange requests. |
@giuseppe PTAL |
that code path is taken only when we are converting a "non partial" image to zstd:chunked. This feature is disabled by default, because it is really slow, but we have added it to allow composefs also with images that are not in the zstd:chunked format. In theory the conversion could be optimized and each file checksummed on the fly, but it would be a completely different implementation, so the easiest way to enable composefs with traditional images was just to fetch the entire tarball, convert it to zstd:chunked and then work on the converted file so all the existing code works with it. The only benefit is that we have fs-verity at runtime for these images, but the conversion cost is quite high at the moment. |
merging of adjacent chunks is already implemented here: storage/pkg/chunked/storage_linux.go Line 1208 in 5cd00c5
|
Indeed. Perhaps then it is not aggressive enough. If I understand the code correctly, you keep merging chunks while you have more than storage/pkg/chunked/storage_linux.go Lines 44 to 45 in 5cd00c5
Provided either holds you will keep merging. Both of these seem very lenient. Maybe it is worth defining a new parameter called maxNumberMissingChunks = min(1024, image_size / transferOverheadMb)
autoMergePartsThreshold = transferOverheadMb Therefore if a layer is say 50MB it will at most become 5 chunks, and more likely 1-2. So it would be downloaded immediately. Perhaps you do this already though. |
Also, the progress slider of partial pulls has an update rate of .5hz, whereas non partial pulls are around 5hz. It makes a substantial difference in the DX/UX of partial pulls but should be filed under a new issue. |
we keep merging until the server accepts the request, since when we request too many chunks the header can be too big. I liked the idea @owtaylor had and that is not implemented yet, that instead of trying to keep merging chunks, we simply do more HTTP range requests. |
@antheas Great conversation, lets pick out the best |
Seems so. Ran some tests under Wireshark and the github servers do not seem very happy. 1M packets with a lot of errors vs 300k clean during an image pull.
To answer my own question, the progress slider freezing/unfreezing coincides with Wireshark logging failed requests.
I do not see how this would improve things. Splitting a large layer (e.g 1-2GB) into multiple streams may be a way to speed up downloads as an end-game optimization. But as podman already pulls layers in parallel I do not know if it would help in most cases. Also, you could just download the layers ordered by size, to minimize large last layers being stuck downloading by themselves like they are right now. As for prodding the server to find the max layer size, I think following the spec of partial requests might be better and more respectful to registries. Note that AWS charges for failed requests as well. Under it, you are supposed to send a range request and depending on the response:
https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests#partial_request_responses Therefore, a solid way to implement the pulls would be formulate a range plan that is sane (maybe at most 100 chunks; example: #1928 (comment) ), send it to the registry without checking Note that I do not know the details of multirange requests. If there is no overhead per chunk, it might make sense to use a large number of chunks. But it seems there is. If there is a small overhead, 5MB might be too large of a gap and maybe 1MB would be better. I doubt 128 bytes is good though. Under wireshark packet sizes for partial pulls were on average 6kb vs 30kb for without them. If you have another registry with a ~5GB chunked image (e.g., quay) I can also give it a go. |
depends how you look at it. @owtaylor was looking at reducing the network traffic so splitting the chunks in multiple requests would help to not have to merge chunks too aggressively. So I think we need to balance between the two needs, should this be configurable?
We already handle the 200 response code if the registry returns that |
So the registry returns a different error if your chunk ranges are too large I would guess then.
Indeed. There are 3 conflicting goals:
I am not fronting the Quay hosting budget, so I focused on number 3 here. I also focused on n3 because we were waiting on It is understandable and a good justification to have the download take a bit longer if it means the registry has bandwidth savings. However, I think 2 and 3 need a bit of attention here.
Yeah it should be configurable but how will take a bit of experimentation. Perhaps theres a turnkey parameter that works great though so it wont be needed. |
Increase the threshold for auto-merging parts from 128 to 1024. This change aims to reduce the number of parts in an HTTP multi-range request, thus increasing the likelihood that the server will accept the request. The previous threshold of 128 often resulted in a large number of small ranges, which could lead to HTTP multi-range requests being rejected by servers due to the excessive number of parts. It partially addresses the reported issue. Reported-by: containers#1928 Signed-off-by: Giuseppe Scrivano <[email protected]>
I've seen some improvements with: #1937 |
Something changed in 5.1.0 and I can not reproduce this issue anymore when I do not have cached layers. It downloads properly now. I can still reproduce the issue in 5.0.3. I will run some tests when there is a cache with the new ver + PR.
|
Since its been two weeks, this repo has a large number of images with little differences we can test with. In the following tests, I will download two images created one week apart so that around half of the layers are identical and half of the layers are not, here are the commands:
If we bump the numbers we get the following:
It seems that the partial images feature broke in 5.1.0. I always have 0 bytes saved now ( |
I think you might need #1936 |
Did the zstd:chunked spec change? If so we would need to start generating images with a new podman to test with. Edit: Indeed that PR fixed it |
We can't compare the two code paths. The new code does more than just trying to reduce the amount of data requested from the registry. When using partial pulls, we need to use the new code paths even for images that do not share any files with local storage; otherwise, we won't be able to reuse their files for deduplication later. The old code path (the layers pulled without the skipped: info) was just pulling a tarball and extracting it locally. That is. There is nothing more involved and don't know what files were pulled or even attempt to deduplicate them. Differently, the new code path used for partial images not only tries to reduce the amount of data requested from the server but also calculates/validates the checksum for each file in the archive and attempts to deduplicate with already existing files in the storage (using reflinks). This process is also necessary to know which files are in the local storage so that future partial pulls can reuse these files. Said so, it might be good to investigate avoiding a multirange request when the client is about to request most of its data anyway, so we can avoid the |
But nothing stops us from doing this without zstd:chunked either. It's basically what the ostree-container storage is doing, just parsing classic tarballs and computing the ostree sha256 on them. (But that's the ostree-sha256, not the content-sha256, which is again different from the fsverity-sha256 that I'd argue we should be centralizing on, with content-sha256 as an alternative index perhaps) |
that is what the convert to zstd:chunked code path does at the moment. That can be optimized of course, without requiring a conversion. It would be much more complicated than what we could do in a few lines of code now. |
Issue Description
Hi, since we're hearing a lot about the
zstd:chunked
encoding, I ran some tests today, and across the board it appears that the partial images feature is 2-3x slower, regardless of the cached amount (although the tests do not show partially cached layers; it was tested).When the partial images feature is on, under a registry and client that have more than ~100mbps of bandwidth, image pulling with the partial mages feature is 2-3x slower. Its closer to 2x slower when compared to
zstd:chunked
with the feature turned off and 3x slower compared togzip
'd layers.Steps to reproduce the issue
I ran the following with a 1gbps connection (as confirmed by speedtest.net ~900mbps) from github with a ping of 19ms, saved to a gen 4 nvme ssd, backed with a 7840U class AMD CPU (8c/16t/16GB RAM).
For context, the image
bluefin-dx
extendsbluefin
with 2 layers, each 1GB (total 2GB).Describe the results you received
Across dozens of tests, downloading
zstd:chunked
images is 2-3x slower regardless of the cached amount. This is when using the same image, compressed the same way, just by turning off the partial images feature.Describe the results you expected
zstd:chunked
images pulled with partial images set to true should have 90% to 100% of the speed of not using the feature. If there is cache, they should be 2-3x faster (depending on cached files).podman info output
Podman in a container
No
Privileged Or Rootless
None
Upstream Latest Release
Yes
Additional environment details
No response
Additional information
It seems that the partial images feature is not able to saturate a fast internet connection. This is unlike the stock behavior of podman, which can. Here are some network screens of downloading the same image without a cache (not the one above; same image both cases; cache was reset between screenshots).
Stock podman (finishes within 40s and fits within graph)
Partial Images Feature (took 108s so is truncated)
The text was updated successfully, but these errors were encountered: