Verifying Project Archives
This article details a few methods you can use to check the integrity of a Reveal archive. Each method in this article should be expected to require more manual time than the one before it, but each additional method should provide more assurance of the integrity of the data, which is after all the point of the exercise.
Does every ZIP verify successfully?
After initially downloading all ZIP's it's worth using a tool such as 7-Zip to verify the integrity of every file:
Should every ZIP verify fine, you should see a message similar to the below:
Note that it is not a problem if some of the ZIP files contain little to no data as long as they verify successfully. The way the archive runs it is possible for smaller cases to end up with empty ZIP files.
Quick content check.
Extract every ZIP to the same directory. In both WinRAR and 7-zip you can do this by highlighting every ZIP file and selecting Extract Here. After extraction, you should see two top-level directories similar to this:
The revealdata-s3store-XXXXXX directory contains the raw natives, text, etc. for the case organized as they existed in our storage location. The Logs and Reports directory contains logging information about the archive. Inside that folder you should see a logs directory and a User Information.csv report.
For a final quick content check, in the revealdata-s3store-XXXXXX directory there should be a single directory with a name longer than the rest:
This directory should contain a SQL backup .BAK file for the database:
Log check
Back inside the Logs and Reports directory there should be a subdirectory named logs. Inside this logs directory there should be a log file named archive-creation.log. This log has two fast things to check to make sure no errors happened:
First do a case sensitive search for ERROR. If any high-level errors were hit in the archive then you should see some entries with that text. However, most likely if a high-level error like this is thrown the archive will already be marked as erroneous in Reveal's web front-end:
At the bottom of this log should be a line that says:
Total files expected to be archived not accounting for errors
In this extremely small testing example there are expected to be twenty-five files in the archive:
If you go back to the top of the revealdata-s3store-XXXXXX directory and run a file count, the total number of files should match the log's expectation:
ZIP hash check
This last step will require manual or programmatic effort on the side of the verifier. At time of writing Reveal has no automated local verification program for archives. At time of writing hashes for archive files use the SHA256 algorithm. The below screenshots show the hashing of files via Powershell with the following command:
get-filehash -Algorithm SHA256
The archive-creation.log file contains hash information for every ZIP and every file successfully archived. Hashes for ZIP files can be pulled by searching for the following text:
Ending ZIP file
For example:
2020-01-11 06:07:48,015 - MainThread - INFO - Ending ZIP file: s3://revealdata-s3store-000500/1000000000/0000000352/Archive/af90dcc1-0aff-4ead-bce4-41f22be29444/62662d9e-7acb-46cc-b4a2-5a8d8bf674d7.zip with hash: 8739c76e681f900923b900c9df0ef75cf421d39cabb54650c4b9ad19b6a76d85
Back at where you're storing the ZIP files, the file 62662d9e-7acb-46cc-b4a2-5a8d8bf674d7.zip should have the SHA256 hash 8739c76e681f900923b900c9df0ef75cf421d39cabb54650c4b9ad19b6a76d85:
Ideally, by checking each ZIP hash in this manner you are also verifying the hash integrity of every contained document.
Individual File hash check
If you want to go down to a file level for hash checking, search for the following lines in the archive-creation.log file:
File archived successfully
For example:
2020-01-11 06:07:33,193 - MainThread - INFO - File archived successfully: s3://revealdata-s3store-000500/c99e413f-7bda-42f8-a0a8-de5bf9ba48d0/corp_99999_00000Enron.bak 1414a34c7578ba7b8433b6362a52369a42034ddd5ff1fb70a0384804dc85bd4e
In this example, the file located at revealdata-s3store-000500/c99e413f-7bda-42f8-a0a8-de5bf9ba48d0/corp_99999_00000Enron.bak is expected to have the SHA256 hash 1414a34c7578ba7b8433b6362a52369a42034ddd5ff1fb70a0384804dc85bd4e:
In this manner you can parse the archive-creation.log file and verify the integrity of every file in the archive via SHA256 hash.