Skip to main content

Reveal Review Publication

Reveal Cloud AI Installation

1 General Description and Setup

1.1 Overall Description

The Cloud AI platform of Reveal is essentially a middleware layer that handles communication with AWS for cloud analytic functions. At time of writing we support integration with AWS analytics only.

The system is comprised of multiple parts that may or may not be available depending on the makeup of the Reveal architecture accessing the system. The system is generally comprised of the following parts:

  • Orchestration This is a centralized server that maintains the following:

    1. Mongo database

    2. API on top of the mongo database

    3. Several services that perform minimal work, e.g. Watching for AWS jobs to finish.

  • Archive Worker This is a server that is turned on when archiving work is available to be performed. Use of this functionality requires all data for your Reveal site to be stored in S3.

  • Transcription Worker This is a server that exists to perform heavy-lifting transcoding work on au-dio/video files.

As described above, while in this document we will describe these three as separate components, there's no reason they need to be separate. They in fact could all reside on the same server depending on the desired setup and server load.

All servers in this system are intended to run on CentOS-based Linux operating systems. There's no particular reason they cannot run on Windows as the code running the platform is python-based, however the documentation herein assumes CentOS Linux 8.4.2105.

Additionally if archiving is to be put in place within the system, there is an extra service that performs database backups necessary to be installed on the Review database server. Ignore all archiving-related instruction in this document if archiving is not being set up.

1.2 Setup Checklist

  1. AWS Account Provisioned

  2. Linux Servers Provisioned (CentOS 8.4.2105)

  3. IAM Setup Items Done

  4. KMS Key Setup Done

  5. Network Security Allowed

  6. Orchestration Server Setup Performed

  7. Transcription Server Setup Performed

  8. Review Reconfiguration Performed

1.3 Storage

Shorter description:

  • Orchestration 50GB root and 1TB storage.

  • Archive Worker 50GB root.

  • Transcription Worker 50GB root and 1TB storage.

Longer description:

  • Orchestration This server should have as much free space as the largest image labeling job you intend to run times 2. The reason being that part of the work the Orchestration server performs is syncing down images for image labeling jobs, then it may convert those files to AWS-usable formats. Thus taking up twice the space (potentially) for every image you submit for labelling at a single time.

  • Archive Worker This server needs very little storage space. Around a 50GB root should be plenty. The only thing that takes up space on the archive worker is archiving logs which are cleaned after every job.

  • Transcription Worker This server should have as much free space as the largest transcription job you intend to run times 2. Similar to image labeling, transcription jobs may require media transcoding which ends up creating an extra file for every audio/video file you send.

1.4 Resources

  • Orchestration In our production environment this is an AWS t2.medium (2CPU, 4GB RAM).

  • Archive Worker In our production environment this is an AWS t2.2xlarge (8CPU, 32GB RAM).

  • Transcription Worker In our production environment this is an AWS t2.2xlarge (8CPU, 32GB RAM).

1.5 Staging S3 bucket

The Cloud AI system uses an S3 bucket for staging purposes. This is necessary as AWS services require data to be stored locally in S3 for analysis to be performed. In this document we use the bucket name 'revealdata-analytics-temp-us'. Replace this in all policies with the name of your staging bucket.

The Cloud AI services to not automatically clean this S3 staging bucket. We suggest setting up a cleanup policy on this bucket to ensure that old data are properly deleted.

If you want to maintain data in a particular region, be careful to set up this staging bucket in the specific AWS region the data reside in.

1.6 Networking

By Source (outbound):

Source

Destination

Port

Reason

Archive Worker

Orchestration

Orchestration

SQL Server

SQL Server

AWS

27017

5000

1433

8500

443

Queue Monitor and Queries

API Queries

Query databases

Database Backup API Queries

Allow communication to AWS resources

Reveal Web Server

Orchestration

443

API Queries to add work.

Orchestration

AWS

443

Allow communication to AWS resources

SQL Server

AWS

443

Database backup upload (archiving only)

Transcription Worker

AWS

Orchestration

443

27017

Allow communication to AWS resources

Queue Monitor and Queries

By Destination (inbound):

Destination

Source

Port

Reason

Orchestration

Worker

Archive Worker

Reveal Web Server

Transcription Worker

27017

5000

5000

27017

Queue Monitor and Queries

API Queries

API Queries to add work.

Queue queries.

SQL Server

Archive Worker

Archive Worker

1433

8500

Query databases

Database Backup API Queries

AWS

Archive Worker

Transcription Worker

SQL Server

Monitor

443

443

443

443

Allow communication to AWS resources

Allow communication to AWS resources

Database Backup upload (archiving only)

Allow communication to AWS resources

Review also has the following networking requirements to link into this system:

  1. Review Index Batch will need to query the Orchestration server via TCP 5000 and communicate with S3.

  2. The Review Storage Service will need to be able to communicate with S3 in order to access finished data from the staging bucket.

  3. TCP 443 should be allowed outbound for all intended interactions with AWS Analytics (S3, Rekognition, etc.).

1.7 AWS Services

The Cloud AI platform makes use of the following AWS services:

  1. Rekognition (image labelling)

  2. Comprehend (language detection in transcription)

  3. Translate

  4. Transcribe

You must verify that these services exist in the AWS region you're setting up in. Specifically Trans-late's Bulk functionality is not available in every region. Generally speaking information about avail­ability of AWS functions can be found here:

https://docs.aws.amazon.com/general/latest/gr/aws-service-information.html

Just browse to the desired service and check the API availability in your target region. Below is a list of the services we use:

https://docs.aws.amazon.com/general/latest/gr/rekognition.html

https://docs.aws.amazon.com/general/latest/gr/comprehend.html

https://docs.aws.amazon.com/general/latest/gr/transcribe.html

https://docs.aws.amazon.com/general/latest/gr/translate-service.html

https://docs.aws.amazon.com/general/latest/gr/s3.html

https://docs.aws.amazon.com/general/latest/gr/ec2-service.html

For additional confirmation, see the below information to visually confirm availability:

Rekognition https://console.aws.amazon.com/rekognition/home?region=us-east-1#/

All necessary rekognition functions should be available in all regions, however you can verify by checking that Object and Scene Detection is available.

Comprehend https://console.aws .amazon. com/comprehend/v2/home?region=us-east-1#welcome

Check the left side for 'Analysis Jobs'. This is used for automatic language detection for transla­tion. Be aware that Amazon charges relatively more for this function than other services. It can be manually disabled in the Cloud AI system on request.

Translation https://console.aws.amazon.com/translate/home?region=us-east-1#translation

Check the left side of the console, if 'Batch Translation' appears then it should be available in the selected region.

Transcription https://console.aws.amazon.com/transcribe/home?region=us-east-1#jobs

Check the left hand panel to ensure 'Transcription Jobs' is available.

Each of the above functions can be disabled depending on if the region you are operating in supports them.

1.8 IAM Items

Cloud AI makes use of a few different IAM Users/Roles:

  • Review Full Access User Used by Review to place data in the staging bucket. If you have already set up S3 for Review then this user will already be set up. Just make sure you add in the read/write permissions described in Section 3.2.

  • Cloud AI User Used by the cloud AI system to call specific AWS functions. The required permissions block is described in Section 3.3.

  • AWS Services Role This role will be provided to various AWS services to allow them to access the staging bucket. The required role is described in Section 3.4.

1.9 Data Security

1.9.1 AWS use of input data

One item to note is that AWS does by default use the data you provide to its analytics functions as a basis to improve their systems. For example, see this document on the Translate service:

https://aws.amazon.com/translate/faqs/

Under data privacy see the question, Are text inputs processed by Amazon Translate stored, and how are they used by AWS? As they mention in the article you message AWS to opt out of these sharing policies.

1.9.2 Data encryption

Both image labeling and transcription allow the use of KMS Keys for the encryption of input data. KMS is usually desirable as the data sent to AWS are encrypted by keys specific to your account/client.

Bulk translation however does not allow the use of KMS at time of writing. This is a limitation on AWS's side. To that end we encrypt data with AWS's generic server-side encryption when sending data to translation. This doesn't provide the same features as KMS, but does still ensure encryption at rest.

1.10 On Prem Installation Versus SaaS Considerations

Note that Cloud AI setup is required to enable some features of Review. Archiving, uploader, and sharable links rely upon S3 which is why these items are not supported in on-premises installations. These are lists of features either requiring extra configuration or unavailable in on-premises installations.

Requires Extra Configuration/AWS Account

  • Translation

  • Transcription

  • Image Labeling  

Not Available

  • Easy archiving (case management screen)

  • Web Uploader

  • E-mailing link to a completed export (new feature to 10.3 not supported on prem)

2 Machine-Specific Setup Instructions

2.1 Orchestration

Orchestration will have the following features set up on it:

  1. Mongo Database

  2. Python-based flask API

  3. Translation Worker

  4. Image Labeling Worker

  5. Transcription Finisher

  6. Archiving/Transcription Work Monitor

Below we describe the steps necessary to set up this Orchestration server:

  1. Unzip the RevealCloudAI.zip package at /opt/reveal/.

  2. Run the bash script located here:

    /opt/reveal/RevealCloudAI/ Installation / orchestration_install . sh This script will perform the following:

    1. Install and Configure MongoDB

    2. Install Python3

    3. Install the AWS CLI

    4. Install several Python3 Libraries

    5. Install and Configure NGINX

    6. Install and Configure a Python uWSGI Virtual Environment

    7. Set up Reveal Config Files

    8. Install several systemd services

  3. Assuming the script ran correctly, you should have two config files here:

    /etc/Cloud_API.config

    /etc/cloud—ai—automation.config

    Edit both of these config files and fix values as necessary.

  4. To test, you can run the following in the order below, making sure to double-check the status of every service after starting:

    systemctl start mongod

    systemctl start nginx

    systemctl start reveal_cloud_ai_api.service

    systemctl start image_labeling_worker.service

    systemctl start reveal_cloud_ai_monitor.service

    systemctl start transcription_worker.service

    systemctl start translation_worker.service

2.2 Transcription Master

Transcription Master will have the following features set up on it:

  1. FFMPEG/FFPROBE Install

  2. Transcription Master

Below we describe the steps necessary to set up this Transcription Master server:

  1. Unzip the RevealCloudAI.zip package at /opt/reveal/.

  2. Run the bash script located here:

    /opt/reveal/RevealCloudAI/Installation/transcription_install.sh

    This script will perform the following:

    1. Install Python3

    2. Install several Python3 Libraries

    3. Download FFMPEG and FFPROBE

    4. Set up Reveal Config Files

    5. Install the Transcription Master systemd service

  3. Assuming the script ran correctly, you should have a config file here:

    /etc/Cloud_API.config

    Edit this config file and fix values as necessary.

  4. To test startup, you can run the following:

    systemctl start transcription_master.service

3 AWS IAM Setup Items

In this section we describe the necessary items to set up within AWS.

3.1 KMS Policy

Replace the users here as necessary. The user 'revealdata-s3store-000000-fullaccess' represents the configured AWS Credentials for Review, while 'svc-cloud-api-us' represents the configured users for the CloudAl system.

This KMS key will be used by Review to encrypt data being sent to the CloudAl system. It will also be used by the CloudAl for submitting Image Labeling and Transcription jobs. Translation jobs will use generic AWS server-side encryption per AWS limitations.

{ "Version": "2012-10-17", "Id": "key-consolepolicy-3", "Statement": [ { "Sid": "Enable IAM User Permissions", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::326122048023:root" }, "Action": "kms:*", "Resource": "*" }, { "Sid": "Allow use of the key", "Effect": "Allow", "Principal": { "AWS": [ "arn:aws:iam::326122048023:user/revealdata-s3store-000000-fullaccess", "arn:aws:iam::326122048023:user/svc-cloud-api-us" ] }, "Action": [ "kms:Encrypt", "kms:Decrypt", "kms:ReEncrypt*", "kms:GenerateDataKey*", "kms:DescribeKey", "kms:ListAliases" ], "Resource": "*" }, { "Sid": "Allow attachment of persistent resources", "Effect": "Allow", "Principal": { "AWS": [ "arn:aws:iam::326122048023:user/revealdata-s3store-000000-fullaccess", "arn:aws:iam::326122048023:user/svc-cloud-api-us" ] }, "Action": [ "kms:CreateGrant", "kms:ListGrants", "kms:RevokeGrant" ], "Resource": "*", "Condition": { "Bool": { "kms:GrantIsForAWSResource": "true" } } } ] }

3.2 IAM User for Review (when using local storage)

Replace 'revealdata-analytics-temp-us' with your staging bucket name.

{ "Version": "2012-10-17", "Statement": [ { "Action": [ "s3:GetObject", "s3:PutObject" ], "Resource": [ "arn:aws:s3:::revealdata-analytics-temp-us/*" ], "Effect": "Allow" }, { "Action": [ "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::revealdata-analytics-temp-us" ], "Effect": "Allow" } ] }

3.3 IAM User for CloudAl

Replace 'revealdata-analytics-temp-us' with staging bucket name same as the IAM User for Review in Section3.2. Replace 'arn:aws dam::326122048023:role/CloudAI-AccessAnalyticsTempUS' with the ARN of a role that will allow AWS services to access the analytics bucket. Role described in Section 3.4.

{ "Version": "2012-10-17", "Statement": [ { "Sid": "VisualEditor0", "Effect": "Allow", "Action": [ "transcribe:*", "translate:*", "rekognition:*", "comprehend:*" ], "Resource": "*" }, { "Sid": "VisualEditor1", "Effect": "Allow", "Action": "iam:PassRole", "Resource": [ "arn:aws:iam::326122048023:role/CloudAI-AccessAnalyticsTempUS" ] }, { "Sid": "VisualEditor2", "Effect": "Allow", "Action": "s3:*", "Resource": [ "arn:aws:s3:::revealdata-analytics-temp-us/*", "arn:aws:s3:::revealdata-analytics-temp-us" ] } ] }

3.4 Role for AWS Services to access staging bucket

Replace 'revealdata-analytics-temp-us' with your staging bucket name.

{ "Version": "2012-10-17", "Statement": [ { "Action": [ "s3:GetObject", "s3:PutObject" ], "Resource": [ "arn:aws:s3:::revealdata-analytics-temp-us/*" ], "Effect": "Allow" }, { "Action": [ "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::revealdata-analytics-temp-us" ], "Effect": "Allow" } ] }

Additionally, the role here must have the following Trust Relationship set:

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "comprehend.amazonaws.com", "transcribe.amazonaws.com", "translate.amazonaws.com" ] }, "Action": "sts:AssumeRole" } ] }

This trust relationship is set in the following location:

AWS_Services_Trust_Relationship_setting.png
4 Testing

The most straightforward way to test

In the RevealCloudAI package there's a directory named 'testing' that contains a few scripts to facilitate testing the system:

RevealCloudAI/testing % is

__init__ .py testing_labeling.py testing_transcription .py testing_translation.py

Generally speaking each script can be run in the following manner:

python testing_transcription.py directory_with_audio —h 'http://1.2.3.4:5000/api/ '

Each should attempt to push documents to the CloudAl and wait for results to be returned.