-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save entire Web page to IPFS #91
Comments
Sounds good.
Update: 2read extension is a great poc! |
@victorbjelkholm do you think websaver (or its parts responsible for saving DOM) could be re-used for this? |
@lidel for sure! I think the hard part is replicating the DOM into a string that can be used to render it again. Websaver kind of works, but it's hold together with hacks as I couldn't find a clean solution of serializing the DOM. Best way I found was this:
Then the second part (edit: this is actually what happens first, then the serialization happens) is going through all I also hit another tricky issue that I'm unsure of how to solve. Current implementation is naive in that it assumes that URLs mostly end up being the same content for all users, which is not true. Sometimes web applications renders data into JS files (hurr) and served as normal scripts. |
I know the pain: we can't even use MHTML as it is not supported by browser vendors (anymore). My hope is that in the long term something like webpackage will gain adoption. It aims to address (among other things) website snapshoting use case in a safe and reproducible manner that is aware of HTTP semantics: webpackage: Save and share a web page (Use Case) – sounds super relevant to what we want as the endgame here and for ipfs/in-web-browsers#94 in general. But for now, doing a rough snapshot via inlining+serialization you described along with screenshot could cover work in ~80% of use cases and is something we could do with today's tools. @victorb Is websaver repo available somewhere, or is it just a quick hack distilled into the snipped above? |
Reopening: I believe "Add to IPFS" via right-click on a page will only save HTML alone. Mirroring full page with assets (images, CSS and JS) require additional work integrating something like websaver by @victorb. |
Is this issue the cause why this import of PrivacyTools homepage looks broken? I took it as a demo and was going to open a duplicate or ask about it as part of #850 as I stored three pages to IPFS and only encountered this issue with it. |
Bumping this due to its being mentioned again in #948. |
@Gozala do you think there are lessons from https://github.com/inkandswitch/xcrpt that we could apply here without spinning our wheels too much? Sounds like a similar problem space and a really useful feature to have. |
Thanks for asking @lidel, indeed goals are very similar here. I think there couple of things I'd share from building that:
One other more meta point I'd like to make is that I think it would be better to see more interoperable tools working in concert with each other than attempting to build a whole suit of tools into a product in this case ipfs-companion. That is to suggest I think it would be more desirable to seek other projects in the space that are already working on archiving pages in some form & figure out ways how collectively they could enhance each other. E.g. braves approach of suggesting to enable an IPFS extension when navigating to resource on IPFS is a good example where functionality may not be built-in but it provides a good way to add such a functionality. Hope some of this is helpful here. |
@Gozala thanks for the mention! Here's just two examples:
This archive (of Parametric Press Issue 02) is ~100MB, but everything is loaded on-demand. You can link to specific pages in the archive and search queries, and text search The archives are serialized as standard WARCs or WACZ (a new zip based format that supports random access). I think this approach works much better than saving static DOM, which generally is limited to pages that are self-contained or fully static, and is not necessarily simpler to create. If you really only want static content, another extension that works really well for that is: https://github.com/gildas-lormeau/SingleFile Depending on what the goals of this issue are, I suppose you could close this as resolved :) I'll have more updates on this work soon, and would happy to chat more about collaborations if anyone is interested! |
@ikreymer that is really impressive work, I'm blown away!
I think it really depends on what the goals are. That said your work got me thinking that it may indeed be better to capture all you can (like you do) and just have a different ways to view that archive. That way you could view a stripped down version like markdown or a full replica.
That is exciting I think it might be great to demo your work on one of the community calls. @autonome might be a good person to chat about possible collobartions |
@ikreymer This could be a great lightning talk at one of the upcoming IPFS virtual meetups 😊 If you'd be interested in talking, here's the speaker form: https://protocollabs.typeform.com/to/hLGfKhxn |
Thanks!
Yes, I agree, it does depend on what the goals are. And starting with a 'high-fidelity' archive, you can always 'downsample' to just getting the static DOM later, or a more limited view, etc...
Thanks, I'll sign-up for a future call, would be happy to share this! |
This had been on my mind once again and I recalled as to how I have settled on DOM snapshots as preferred option. As I we were exploring alternative medium for web where we fused notion of browser history and tabs https://www.freecodecamp.org/news/lossless-web-navigation-with-trails-9cd48c0abb56/ We end up wanting to unload unused tabs and replacing those with card renderings, which that could be on switch be brought back to life, in the exact same state in way after tab was loaded. Without having safari like flush reload action loosing scroll position etc... Saving frozen JS-less DOM provided a reasonable experience, we would just overlay it with a control that told user it was snapshot of the page from a specific date and allowed user to go to current version. That is to say capturing archive will not be able to fulfill that use case. So ideally I think tool would both capture dom snapshot along with web archive and provide you a way to go from snapshot to archive to current page and vice versa. |
Thanks for sharing these! Yeah, I think it comes down to whether the 'JS-less DOM' is a reasonable experience, or not, But the trails idea and various spatial navigations are definitely cool ideas, would be happy to chat about it at some point! One other thing I've experimented with is saving the window.history stack, and then recreating via history.pushState. On a few sites, this can actually give you a way to replay a dynamic page that was achieved by several history navigations, and allow to 'go back'. But only works well one sites that 'play nice' with history api.. Hope to revisit this idea at some point. |
I wanted to finally share, we've just launched the ArchiveWeb.page chrome extension, which allows for archiving any page in a chrome-based browser (via the debug protocol to get full-fidelity). The ArchiveWeb.page extension also includes experimental IPFS support, so users can archive a page (or several), then share via IPFS, and send a link to load via replayweb.page or via a regular. Here's more info in our guide There was some work to optimize the system for on-demand loading of a web archive. A web archive may get sufficiently large, and to avoid pulling all the blocks in the multihash over to preload node at once, when loading via ReplayWeb.page, the system tries to pull only the blocks needed for a particular page. The main issue for now is that if the websocket connection to the preload node is lost and is not reestablished, the sharing is stopped (I understand this is being addressed via: libp2p/js-libp2p#744) Also looking to see how this can work better in Brave using the native IPFS node. I'll sign-up for the virtual meet-up and happy to talk more about this work! |
Thanks for the update! If you'd be interested in presenting at one of the virtual meetups, send a note to [email protected] with some details about your extension and what you'd specifically be interested in presenting. Also, consider submitting to Awesome IPFS? https://github.com/ipfs/awesome-ipfs/blob/master/CONTRIBUTING.md |
For drive-by readers, this video was recorded at recent IPFS Community Meetup and gives a good overview and demo of the system created by @ikreymer I think for the time being we will aim at making single-page snapshots created by Companion bit more useful (eg. inline images and css so things look decent) so Firefox users have something, but for advanced archiving point at https://replayweb.page + separate extension. UI TBD. |
Degraded user experience example when the user is just shown raw HTML content upon trying to "Import to IPFS" a web page.
(This is a part of meta-issue: Mirroring Web to IPFS tracked at ipfs/in-web-browsers#94)
in addition to #59
The text was updated successfully, but these errors were encountered: