>>884454
I think I am going to write some proof of concept code for the simpler alternative method, none of the code is revolutionary here, probably literally use wget to pull the site archive.
I came up with a few extra options today:
-Instead of providing the archive, provide diff files
----Client hashes all binary files, sends file listing and hashes to server (this is going to require a custom client).
----Client sends full text/html files to server.
----Server checks binary hashes for mismatches, missing files, extra files, etc
----Server run's diff on text/html files, generates diff and patch files
----Server sends back binary hash mismatch data, along with text/html diff and patch files
----Client creates a single tar archive with client's original archive + mismatch file + diff/patch files, hashes it
----Server does the same, hashes should match, it has the same data as the client now.
----Server store's this hash for verification.
Advantages:
-The client can either patch the text/html files to generate a fucked copy or not, the archive should be verified either way.
-The server doesn't provide illegal content to the client in case the client is an asshole, which is going to happen.
-The server isn't providing the website file's directly, seems like should be less DMCA-able.
Disadvantages:
-Even if the hashes aren't, the diff files on the text/html source are probably considered a derivative work, it only provides them during the exchange, there's nothing to DMCA but the hashes, but during the exchange they might be able to claim there's copyright infringement going on.
-Instead of providing hashes, sign the archive with GPG
Advantages:
-Hashes can't be DMCA'd, they don't exist, the verification isn't with the site itself, it's with a 3rd party keyserver.
-If the site goes down, the archives can still be verified.
Disadvantages:
-Server must provide the full archive, signed. If the diff method above is used, it would have to return the full archive back to the client instead of just the diffs/mismatch file.