Skip to main content

Reveal Review Publication

Reveal AI Threading Technology

  • Email data preprocessing steps in the Story Engine™ include email segmentation. This is the process of identifying individual messages in the email document. Often email documents contain the most recently sent message together with the replies or forwarded messages that are included below. The process of email segmentation identifies each message within email documents; individual messages are called segments in the Story Engine™.

  • Story Engine™ models automatically detects the six components of an email message (segment):

    • Header

    • Greeting

    • Body

    • Signature

    • Disclaimers (footer)

    • Advertisement

  • Story Engine™ performs the analysis of the segment header, including extracting individual communicator information from the email header fields (from, to, cc, bcc), extracting the sent date and normalizing it using the time zone information when available, and extracting the subject line. Communicator names are normalized using the Story Engine™ name normalization process so that alternative email addresses, labels and nicknames for a communicator are all resolved to that entity.

  • Email documents that share segments with one or more additional email documents are organized into a Thread group.

  • The MD5 hash for each segment is computed on the following components of the segment message: body + subject line + greeting + signature. 

  • An additional fingerprint signature hash is computed on the following components of the segment message: body + subject line + greeting + signature. 

    • Story Engine™ uses an algorithm to detect if email segment bodies and header lines may have small text differences but are still duplicates. Email body text may vary from email to email because of different email client applications. For example, an email sent from MS Outlook to Gmail may have different body formatting and additional text characters.

    • Segments from different email documents are deemed to be the same if the following conditions are met:

      • MD5 hash of the content is the same.

      • Sender name after name resolution is the same.

      • Subject line is the same. Subject lines are normalized to remove "Fwd:, RE:" etc.

      • Date/time should match, an approximate match allows them to be within 2 hours of each other to account for the server time drift.

    • If the segments are deemed the same, the documents which contain those segments are assigned to the same email thread. Further analysis of the other segments in those documents is performed to check if the documents have other segments in common.

      • In this step the Story EngineTM threading algorithm allows for near duplicate matches of the body content. Instead of requiring the match on the MD5 hash, the match on the fingerprint hash signatures is allowed. Typically a fingerprint has the 5 parts. 

      • For the shorter segments (Segment body length < 200 characters), all 5 parts of fingerprints have to match to identify segments as near duplicates.

      • In case of longer segments, the Story EngineTM threading algorithm considers segments to be near duplicates if at least 4 fingerprint parts are matching.

      • For other criteria such as sender, subject line and date should match exactly.

  • Story EngineTM will detect one or more Inclusive emails from a Thread group. An inclusive contains all text portions from the Thread group. It is possible to have more than one inclusive from a Thread group. For example, forwarding an email and continuing the original conversation with have two parallel email conversations that have unique emails from each.

    • Included emails have all of their email bodies contained in an Inclusive email and therefore are redundant information.

Reveal AI Story Engine – Thread Intelligence

image1.png
  • Threading Fields Explained:

Field Name

Relativity® Field Type

Details

NexLP_TheadItemType

Multiple Choice

One of the following choices:

1. Inclusive, the most inclusive email in the email chain which is not missing any attachments from included emails. There may be more than one inclusive email per thread.

2. Inclusive Partial, the most inclusive email in the email chain, but with one or more attachments missing.

3. Included, email content is completed included in another email which is identified as "Inclusive" or "Inclusive Partial".

4. Attachment: email attachment. For each thread, if it has multiple copies of the same attachment, only one copy will be defined as “Attachment”. Attachments attached to an inclusive document will be preferred if it exists.

5. Attachment_Secondary: Duplicate attachments in the same thread, by MD5 hash, attached to a different parent.

6. Attachment_Duplicate: All other attachments.

NexLP_ThreadID

Relational, fixed lengh

The unique Thread Id of the email chain this email or attachment belongs to.

NexLP_InclusiveDocId

Relational, text

The DocID of the "Inclusive" or "Inclusive Partial" email that this email or attachment refers to; Populate to both email and attachments.

NexLP_InclusiveEmailId

Relational, fixed lengh

The DocID of the "Inclusive" or "Inclusive Partial" email this email refers to; Not populate on attachments.

NexLP_AttachmentCount

Whole number

Number of attachments this email has. 0 if it is an attachment.

NexLP_SortOrder

Whole number

A whole number representing the sort order Reveal AI uses. The “Inclusive" or "Inclusive Partial" email is assigned the lowest number (starting from 1), followed by its included emails and attachments. If there are multiple "Inclusive" or "Inclusive Partial" emails, these emails are sorted by the minimum DateSent found from their segments. If two most "Inclusive" documents have the same minimum DateSent, the one with the most segments will have a lower number.

NexLP_IndentLevel

Whole number

A whole number used for display purpose. 0 for "Inclusive" or "Inclusive Partial" email, 1 for "Included" email, 2 for attachments of "Inclusive" or "Inclusive Partial" email, 3 for attachments of "Included" emails, 5 is for unknown.