Skip to main content

Reveal Review Publication

Verifying Project Archives

This article details a few methods you can use to check the integrity of a Reveal archive. Each method in this article should be expected to require more manual time than the one before it, but each additional method should provide more assurance of the integrity of the data, which is after all the point of the exercise. 

  1. Does every ZIP verify successfully? 

    After initially downloading all ZIP's it's worth using a tool such as 7-Zip to verify the integrity of every file:

    603d31b8553a9.png

    Should every ZIP verify fine, you should see a message similar to the below:

    603d31bb6ba5b.png

    Note that it is not a problem if some of the ZIP files contain little to no data as long as they verify successfully. The way the archive runs it is possible for smaller cases to end up with empty ZIP files. 

  2. Quick content check.

    Extract every ZIP to the same directory. In both WinRAR and 7-zip you can do this by highlighting every ZIP file and selecting Extract Here. After extraction, you should see two top-level directories similar to this:

    603d31bd6276c.png

    The revealdata-s3store-XXXXXX directory contains the raw natives, text, etc. for the case organized as they existed in our storage location. The Logs and Reports directory contains logging information about the archive. Inside that folder you should see a logs directory and a User Information.csv report. 

    For a final quick content check, in the revealdata-s3store-XXXXXX directory there should be a single directory with a name longer than the rest:

    603d31bff3864.png

    This directory should contain a SQL backup .BAK file for the database:

    603d31c2af0f6.png
  3. Log check 

    Back inside the Logs and Reports directory there should be a subdirectory named logs. Inside this logs directory there should be a log file named archive-creation.log. This log has two fast things to check to make sure no errors happened: 

    1. First do a case sensitive search for ERROR. If any high-level errors were hit in the archive then you should see some entries with that text. However, most likely if a high-level error like this is thrown the archive will already be marked as erroneous in Reveal's web front-end:

      603d31c5333f0.png
    2. At the bottom of this log should be a line that says:  

      Total files expected to be archived not accounting for errors 

      In this extremely small testing example there are expected to be twenty-five files in the archive:

      603d31c8279d8.png

      If you go back to the top of the revealdata-s3store-XXXXXX directory and run a file count, the total number of files should match the log's expectation:

      603d31c9db0d4.png
  4. ZIP hash check 

    This last step will require manual or programmatic effort on the side of the verifier. At time of writing Reveal has no automated local verification program for archives. At time of writing hashes for archive files use the SHA256 algorithm. The below screenshots show the hashing of files via Powershell with the following command: 

    get-filehash -Algorithm SHA256 

    The archive-creation.log file contains hash information for every ZIP and every file successfully archived. Hashes for ZIP files can be pulled by searching for the following text: 

    Ending ZIP file 

    For example: 

    2020-01-11 06:07:48,015 - MainThread - INFO - Ending ZIP file: s3://revealdata-s3store-000500/1000000000/0000000352/Archive/af90dcc1-0aff-4ead-bce4-41f22be29444/62662d9e-7acb-46cc-b4a2-5a8d8bf674d7.zip with hash: 8739c76e681f900923b900c9df0ef75cf421d39cabb54650c4b9ad19b6a76d85 

    Back at where you're storing the ZIP files, the file 62662d9e-7acb-46cc-b4a2-5a8d8bf674d7.zip should have the SHA256 hash 8739c76e681f900923b900c9df0ef75cf421d39cabb54650c4b9ad19b6a76d85:

    603d31cbb68e1.png

    Ideally, by checking each ZIP hash in this manner you are also verifying the hash integrity of every contained document.  

  5. Individual File hash check 

    If you want to go down to a file level for hash checking, search for the following lines in the archive-creation.log file: 

    File archived successfully 

    For example: 

    2020-01-11 06:07:33,193 - MainThread - INFO - File archived successfully:    s3://revealdata-s3store-000500/c99e413f-7bda-42f8-a0a8-de5bf9ba48d0/corp_99999_00000Enron.bak    1414a34c7578ba7b8433b6362a52369a42034ddd5ff1fb70a0384804dc85bd4e 

    In this example, the file located at revealdata-s3store-000500/c99e413f-7bda-42f8-a0a8-de5bf9ba48d0/corp_99999_00000Enron.bak is expected to have the SHA256 hash 1414a34c7578ba7b8433b6362a52369a42034ddd5ff1fb70a0384804dc85bd4e:

    Verifying_Project_Archive_Hash2.png

    In this manner you can parse the archive-creation.log file and verify the integrity of every file in the archive via SHA256 hash.