Skip to main content

Reveal Review Publication

Building and Refining a Custom Entity Type

Overview

Story Engine allows you to create custom entity types keying off your data and adapted to your workflow. You build these entity types through annotating your data as part of an active review, using historical case material or other internal content. To feed the model-building process, you can identify exemplary text components through pattern searches (by way of regular expressions), lists of keywords or wildcards, or you manually annotate document text.

Once this is initiated, the model building goes to work using your examples as a guide. You can refine the resulting entity model by further annotation, identifying additional relevant content or, alternatively, by identifying false positives.

You can publish the models used to generate these new entity types in the model library. The published model can be used to enrich a COSMIC workflow. The model can also be used in other storybooks. Through a succession of projects in which new data and new work product is applied, your bespoke entity models will embody accumulated expertise.

The Custom Entity Model Workflow explained below provides the critical steps in creating a model-driven custom entity type for use in discovery. Complete details are provided in the current Reveal AI User Guide and Reveal AI Admin Guide available from Reveal.

Custom Entity Model Workflow

Let us suppose a dispute involving a hospital’s management of patient injuries. This requires you to find all the documents that mention body parts – ranging from formal to colloquial terms.

Body parts is your entity type. A simple term-search is likely to be incomplete (insufficient recall). Your list might not include the words alveoli or phalanges (both technical terms) or ticker (slang for heart). Results would also likely be noisy (insufficient precision). Carefully read the following sentence: “With an eye to putting a hand on the proper data, you might lose heart or not have the stomach to finger the correct information and nail the truth.” The preceding sentence is not responsive and yet, depending on the search employed, might contain six false positives.

So how do you develop a dynamic model that surpasses the limits of search? Here is a typical approach which is composed of a path through the elements in the accompanying flowchart:

  1. Design a custom search.

  2. Search and extract entities.

  3. Apply user annotations.

  4. Build an entity model.

  5. Assess recall, precision, F1 against targets.

  6. Run the entity model.

  7. Quality control review.

  8. If model needs further development, then repeat steps 3 through 7.

  9. If model is complete then deliver results and/or publish.

Workflow details

  1. Design a custom search

    You might start by authoring a list of terms. You want these terms to be diverse in themselves and in the contexts in which they are likely to appear. For instance: ankle, capillary, iris. You might also augment your search with a list of terms that are technical and will always refer to body parts regardless of context.

    Your initial list does not have to be comprehensive or even large. A large list of words that are only relevant in specific contexts is ineffective because it initiates the process with too many false positives. Better to begin with a modest number of terms and, through iterations of user annotation, gradually teach the system to discern the proper contextual clues from which it will infer new entity examples.

    When considering terms composed of more than one word, distinguish between a.) an entity composed of several words versus b.) a single word entity accompanied by helpful context. For example, finger nail is an example of a body part so the two-word sequence is an entity. On the other hand, mangled finger is a body part preceded by an injury so only the word finger is an entity. Your list can include finger and finger nail but not mangled finger.

    image3.png
  2. Search and extract

    You perform a search based on terms or regular expressions (RegEx).

    image4.png

    Once the search completes you can see the hit report in Launchpad.

    image5.png

    You may manage entities through Launchpad, such as adding an individual entity to a search

    image6.png

    You can search per entity type and then specify whether these entities are detected by a.) the term report (as would be applicable at this point in this workflow), or b.) the entity model (see step 4 below) or c.) found by user (see step 3 below) or any combination of the above.

    image7.png
  3. Apply user annotations

    You review these documents focusing on the hit highlighting. Highlighted terms that result from a search are presented as candidates waiting for you to vote VALID or INVALID. Either of these designations will help model building. If the term is judged to not be helpful in either respect, it may be trashed.

    You must also review for other entity instances and annotate them. Read the entire document where practical, but at an absolute minimum within 30 words before or after a highlighted word. Any word you annotate in this process is a new anchor for reviewing 30 words before and after. The system is making context-based decisions. By this method you are helping the system discriminate with respect to context signals and you are avoiding inadvertently providing conflicting guidance. Do not guess at annotations – only code when certain. Care and thoroughness in this procedure is rewarded by the efficient creation of a reliable model.

    This chart gives examples of mistaken and correct annotations:

    Mistaken annotation

    Correct Annotation

    Reason

    finger nail

    finger nail

    The phrase “finger nail” is a specific body part.

    mangled finger

    mangled finger

    The word “mangled” describes the state of the finger. It could apply to many other body parts and is therefore a helpful context marker.

    small intestine

    small intestine

    The “small intestine” is a specific body part. The word “small” is not a good context marker.

    inflamed eye

    Inflamed eye

    The word “inflamed” describes the state of the eye. It could apply to many other body parts and is therefore a helpful context marker.

    left lung

    left lung

    The word “left” describes the position of the lung. It could apply to many other body parts and may therefore be a helpful context marker.

    As we mentioned above in discussing the design of a custom search, be judicious in designating a multiple word sequence as an entity. Highlight the entire word sequence when those words add up to the entity in itself (such as small intestine). This is in contrast to a single word identifying the entity accompanied by modifiers (such as inflamed eye). In this latter case, you want to highlight only the word eye. This is important because the model needs to accumulate knowledge of helpful contextual terms (such as inflamed, which could modify many body parts). Therefore, avoid incorrectly absorbing such contextual terms into the entity examples.

    The following is a detailed description of user annotation method:

    For custom entities only, you may tell the system if an entity is valid or invalid. By default, they are not assigned as “valid (+)” or “invalid (-)”. These entities are in a “candidate” state.

    When an entity has been annotated incorrectly by a custom entity model, or picked up by an entity search and extract query and is NOT a specific entity, you may annotate it as INVALID. By correctly annotating such false positives you may rapidly improve model accuracy.

    For example, in this instance shown below “arm” is misidentified by an entity search & extract report as a “Body Part”:

    image8.png

    By default, since it was detected by an entity search & extract report, it is not assigned valid (+) or invalid (-).

    You could simply remove the annotation by choosing the trash can icon on the right to delete. However, to help refine the model you may instead click the negative sign:

    image9.png

    This tells the system that this particular use of the term is not relevant. The highlighted word now appears with a strikethrough.

    Alternatively, you may decide that “heart” is a VALID example, and you want the system to learn that it is a valid example. In this case, you would press the “+”:

    image10.png
  4. Build an entity model

    Now that you have searched and annotated, it is time to build a dynamic model from the annotations. First step is to create a model based on the entity type, “Body Parts”.

    image12.png

    You must build a model before you run a model.

    Build

    1. All examples: include ALL annotations associated with the entity (search & extract results, previous model results, and user highlights) or,

    2. Examples annotated by user: user highlights only.

      image13.png

    When you build the model, and before you have run the model, you are presented with current recall, precision and F1 numbers. (You must run the model in order to view the latest highlighted entity candidates. See Step 6 below.)

  5. Assess recall, precision and F1 against goals

    Review the information retrieval measurements. Recall measures the extent to which you have captured all of the intended entity examples. Precision measures the extent to which you have avoided false-positives. F1 is a tempered average of recall and precision. At the margin there is some trade-off between recall and precision, and which takes priority for you will vary per project. For example, if the custom entity is being used to identify privileged or confidential information, the paramount goal is likely recall.

    Information retrieval measurements become more precise as you reiterate the steps 3 through 7. You typically want to run through several rounds of annotation, re-building, running and measuring before you fully rely on these numbers. When these statistics match or exceed your targets, you might decide to publish the model to the library. A more typical sequence is to run the model and engage in Quality Control review in which you take into account both the inspection of any newly highlighted entity candidates, as well as the information retrieval statistics.

    image14.png
  6. Run the entity model

    Running the model generates output with the new entities highlighted. Assuming you are satisfied with the existing information retrieval numbers you may deliver results for legal review. A more typical sequence is to go to Quality Control review.

  7. Quality Control review

    In this step you determine the need for any corrections or additions by way of user annotations as described in step 3 above (view a roll-up of entity hits in Launchpad, and/or review entity hits in documents on a one-by-one basis in the thread viewer). You will typically take into account both the results of your document inspection and the state of your information retrieval measurements. See Step 5 above. If additional annotation is required, go to step 8. If you are completely satisfied, go to step 9.

  8. If Model needs further development, then repeat steps 3 through 7

    When engaging in additional rounds of user annotations, as before, you want to review the resulting hits and, at a minimum, their context windows (30 words before and after). Once again, the 30 words before and after any new entity examples (that you discover during the review) should also be reviewed. Ideally, you review the entire document for missed entities.

  9. If Model is complete, then Deliver Results and/or Publish.

    Once you are satisfied with recall, precision and the results of manual review, you have obtained a highly qualified set of documents containing designated entities. You may deliver the annotated documents (for example, for legal review) and/or publish the model to the library.

    When you publish the model to the library it is available for use in COSMIC for advanced predictive coding. Adding a properly weighted custom entity into a COSMIC workflow can enhance COSMIC performance. Published models are also available for use in other Storybooks.