Skip to main content

Reveal Review Publication

Clustering and Cluster Scores

Clustering analyzes textual documents and provides users with insights into different parts of the data set by creating groups of documents that have a similar content. Clusters are automatically generated in Reveal AI.

Clustered documents may relate to a subject or a type of communication. For example, documents in an Earnings Call cluster would be given a higher cluster score for being closer to the center of the cluster (for example, a report or transcript of such a call as opposed to preparatory discussions).

Reveal AI uses only the document content to compute the clusters. We use full email threads as input to clustering, using only emails from the same thread as a unit for clustering; attachments are not used as part of the thread. Attachments and any other files are clustered individually, independently of the parent emails. Attachments can be part of any cluster. However, most suspected spreadsheets will be in the special cluster "Assumed summaries and reports". Each copy in the thread subsequently receives the same cluster assignment as the thread.

Each copy that has a cluster assignment, has a score of clustering. A higher score indicates the document is closer to the center of the cluster. The cluster score is typically between 4~5.

The cluster name shows the most frequent topics that are most representative of the content of the documents in the cluster. Cluster names are computed for each cluster in the hierarchy.

In cases where there is no most frequent topic or no topics were extracted for most of the documents in the cluster, the system will use the name “Incoherent cluster”.

In addition, the system creates special clusters for documents that cannot be clustered in a meaningful way, for example, empty documents or documents that are very short.