Image processing in Vantiq allows applications to make use of images as another type of sensor input. Just as various sensors on equipment can report their temperature or other facts that let an application monitor their ongoing state, images can be used to gather information where physical state sensors are not appropriate or available.

Consider a town attempting to determine if some area (walkway, street, etc.) is occupied. While it may be possible to instrument every sidewalk with pressure sensors every few inches, it is much more efficient to have cameras provide periodic images of the areas in question. Also, consider a factory setting where there may be security or worker health considerations for people in designated areas. Information provided by cameras in these areas can be used to determine if and what situations may exist and who should be notified.

Image information, however, does not directly provide the facts that are needed by an application. Instead, the images need to be analyzed to produce the information relevant to the application.

Generally speaking, the means by which this analysis happens is via a neural net. A neural net (or, more formally, neural network) is a set of algorithms modeled loosely after the human brain. They interpret data, recognizing patterns, and a particular neural net model is trained to interpret these patterns for some particular purpose. These purposes include, but are not limited to, object recognition, face recognition, etc. So the neural net model is, effectively, part of the application.

To build an application utilizing image processing, there are some general capabilities that are necessary. The application needs to be able to acquire, manipulate, and analyze images. The remainder of this document describes these capabilities within the Vantiq system.

Overview

The Vantiq system has a set of resources and services that are relevant to this area.

The following sections describe these in more detail.

Image Acquisition

Image acquisition in the Vantiq system may be accomplished in a variety of ways. External programs can load images as described in the Images section of the Resource Guide.

Vantiq also offers an Enterprise Connector called the Object Recognition Connector that can be used to monitor a camera, fetch the image, and supply it to the system.

Image Manipulation

As images appear in the system, it is often desirable to manipulate the images. They may need to be resized, converted to black & white, labeled, or entities within the image identified. To perform these operations, the Vantiq system uses the VisionScriptOperation service.

Vision Script

A Vision Script consists of a series of actions that are performed on an image. The actions available are as follows:

The vision script, via the VisionScriptOperation service, processes a specific image. The image in question becomes the working image, and subsequent actions can get information about the current working image or change it. Each action is applied to the image, and the results of that action are available either as the working image or as a result object.

The object returned from the service has a property for each action that provides a result, where the property name is action’s tag property, or, if that is missing, by the action’s name.

Vision Script Structure

Vision scripts in VAIL can be constructed using the VisionScriptBuilder service. Alternatively, they can be constructed directly. A Vision Script object consists of the following properties.

  • scriptName – Optional name of the script. Primarily used for debugging.
  • script – Array of actions that comprise the script

Each action consists of the following properties.

  • name – String that identifies the action to be performed. The actions and their names are described in more detail in the subsequent sections.
  • tag – Optional String containing an identity for the action. This is used to reference that action’s results. If the tag is not provided, the action’s results can be identified by its name.
  • parameters – Object containing the specific parameters for this particular action. The parameters for each action are described with the individual action.

Convert to Grayscale Action

This action converts an image to grayscale or black & white.

  • nameconvertToGrayscale
  • parameters – None

The result of this action is that the working image is converted to grayscale.

Crop Action

This action replaces the current working image with the part of the working image identified by the parameters provided. The portion of the image to which to crop is identified by the upper left corner, and a height & width.

  • namecrop
  • parameters
    • x – Integer x coordinate of the top, left corner of the portion to which to crop
    • y – Integer y coordinate of the top, left corner of the portion to which to crop
    • width – Integer width of the area to which to crop.
    • height – Integer height of the area to which to crop

The result of this action is that the working image is replaced by the cropped portion.

Describe Action

This action provides information about the current working image. The information provided includes whether the image is empty, the size (height & width) of the image, and the number of channels (3 or more for color, 1 for grayscale).

  • namedescribe
  • parameters – None

The result of this action is an entry in the results with the name of the tag (if present) or the name describe containing the data outlined above.

Draw Boxes Action

This action draws boxes on the working image. This is often used to call attention to objects identified by the analysis.

  • namedrawBoxes
  • parameters
    • boxList – List of boxes to draw on the working image. Each box is described as follows.
    • x – Integer X coordinate of upper left corner of box
    • y – Integer Y coordinate of upper left corner of box
    • width – Integer width of box, measured (right) from X coordinate
    • height – Integer height of box, measure (down) from Y coordinate
    • thickness – Optional integer thickness of the box boundary. If not present, defaults to 2.
    • color – Optional object containing 3 Integers (values 0-255): red, green, and blue. For example, { red: 128, green: 128, blue: 128 }. If not present, defaults to red.
    • label – Optional String with which to label the box
    • font – Optional String identifiying font to use (see Draw Text Action for details)
    • isItalic – Optional Boolean indicating whether to italicize text (if possible).

The result of this action is that the working image is replaced by one with the indicated boxes added.

Draw Boxes Using Previous Results Action

This action draws boxes on the working image, obtaining the list of boxes from a previous action. This is often used to call attention to objects identified by the analysis.

  • namedrawBoxes
  • parameters
    • useResultsFrom – String name of the previous action from which to obtain the box list. The name used here is either the tag from the previous action, or, if no tag was used, the name of the previous action.

The result of this action is that the working image is replaced by one with the indicated boxes added.

Draw Text Action

This action puts a text message on the working image.

  • namedrawText
  • parameters
    • x – Integer x coordinate for start of text
    • y – Integer y coordinate for start of text
    • font – String naming the font to use for the text. Font choices are controlled by the underlying platform, and are as follows:
      • HERSHEY_PLAIN
      • HERSHEY_COMPLEX
      • HERSHEY_TRIPLEX (Italics ignored)
      • HERSHEY_SIMPLEX
      • HERSHEY_DUPLEX
      • HERSHEY_COMPLEX_SMALL
      • HERSHEY_SCRIPT_SIMPLEX
      • HERSHEY_SCRIPT_COMPLEX
    • isItalic – Boolean indicating whether to italicize the text (if possible)
    • thickness – Integer thickness in pixels of the text
    • fontScale – Real number by which to scale the text
    • color – Object containing 3 Integers (values 0-255): red, green, and blue. For example, { red: 128, green: 128, blue: 128 }. If not present, defaults to red ({red: 255, green: 0, blue: 0}).

The result of this action is that the working image is replaced by one with the indicated text added.

Find Faces Action

This action locates faces in an image.

  • namefindFaces
  • parameters – None

The results of this action are placed in the service result, identified by the tag, or name if no tag was provided. For each face found, the action result will contain an array entry describing a box outlining the face found. The box description consists of the x and y coordinates (top, left corner), and height and width.

Resize Action

This action replaces the current working image with the same image downsized to the size requested.

  • nameresize
  • parameters
    • width – Integer width to which to resize. Width must be less than or equal to the width of the current working image.
    • height – Integer height to which to resize. Width must be less than or equal to the height of the current working image.

The result of this action is that the working image is replaced by the resized image.

Save Action

This action saves the current working image into an image.

  • namesave
  • parameters
    • saveName – Optional String providing the name to which to save the image. If not provided, the image on which this script is operating will be overwritten.
    • fileType – Optional String providing the Mime type to be used (e.g. image/png, etc.). If not provided, the fileType of the image on which this script is operation will be used.

The result of this action is that the current working image is written as an instance of Images. A script that does not include a save action will result in no changes being made to the image. The result object returned will also contain the image saved (identified by the tag if provided, or action name if not). The returned object will include the name and fileType of the image saved.

Using Vision Script

As an example, we will create a procedure that runs a VisionScript. This script will

  • convert the image to grayscale,
  • save that image,
  • find faces in that image,
  • draw boxes around the faces, and
  • save the image with boxes

For our example, we will run these actions over this image.

Marty and Partner

The procedure is as follows.

PROCEDURE vsExample(imageName String)

    // Create the script
    var script = { scriptName: "FindingFaces"}
    // Build a convertToGrayscale action
    var convertAction = { name: "convertToGrayscale"}
    // Save that image to an image named gsSave.jpg
    var saveGrayScaleAction = { name: "save"}
    var saveGSParams = { saveName: "gsSave.jpg", fileType: "image/jpeg"}
    saveGrayScaleAction.parameters = saveGSParams

    // Find the faces in the image & draw boxes
    var ffAction = { name: "findFaces", tag: "locateFaces" }

    var drawBoxesAction = { name: "drawBoxes"}

    // Use results from the findFaces action to draw boxes on our image
    var dbParams = { useResultsFrom: "locateFaces"}
    drawBoxesAction.parameters = dbParams

    // Save the resulting image to boxedSave.jpg
    var saveBoxedAction = {name: "save", tag: "boxedSave"}
    var saveBoxedParams = { saveName: "boxedSave.jpg", fileType: "image/jpeg"}
    saveBoxedAction.parameters = saveBoxedParams

    script.script = [convertAction,
                    saveGrayScaleAction,
                    ffAction,
                    drawBoxesAction, 
                    saveBoxedAction]

    var result = VisionScriptOperation.processImage(imageName, script)
    return result

When we run this procedure, we will see results that look approximately like this.

{
   "locateFaces": [
      {
         "height": 57,
         "width": 57,
         "x": 224,
         "y": 16
      },
      {
         "height": 60,
         "width": 60,
         "x": 75,
         "y": 47
      }
   ],
   "save": {
      "name": "gsSave.jpg",
      "fileType": "image/jpeg",
      "contentSize": 77543,
      "ars_modifiedAt": "2019-07-15T20:47:44.189Z",
      "ars_modifiedBy": "fhc",
      "content": "/pics/gsSave.jpg"
   },
   "boxedSave": {
      "name": "boxedSave.jpg",
      "fileType": "image/jpeg",
      "contentSize": 78369,
      "ars_modifiedAt": "2019-07-15T20:47:44.203Z",
      "ars_modifiedBy": "fhc",
      "content": "/pics/boxedSave.jpg"
   }
}

There are a few things to note about the results here. First, we see the results of the findFaces action. Since we provided a tag for that action, the results are identified by the tag value. Note that the vision script identified the previous action results for the drawBoxes action using the tag value locateFaces.

This vision script object has two (2) save actions in it, one with a tag and one without. We see the results of both actions – one identified by the tag value boxedSave, and the other is identified by its action name save since no tag was provided. If no tag were provided for either save action, the results would contain only the results from the last save action. This is because only one result with a given name can be returned.

The two saved images produced are the simple grayscale version

Grayscale Image

and the image with faces marked.

Faces identified

In this case, since the image is grayscale, the color of the boxes is black (rather than the default color of red).

Image Analysis

The Vantiq system provides the means to employ specific neural net models to perform analysis of complex data as part of an application. Specifically, the Vantiq system provides the ability to run TensorFlow models. TensorFlow models can be used to process data that is available to the Vantiq system.

TensorFlow models must be provided to the system, as described here.

YOLO-based TensorFlow Models

For detection of objects within an image, Vantiq provides support for YOLO-based TensorFlow models. These are designed to efficiently process an image, identifying entities within that image based on the model’s training.

YOLO-based TensorFlow models are identified by the modelType of tensorflow/yolo. (See the Resource Reference Guide for more information.)

Vantiq uses a TensorFlow implementation of YOLO (You Only Look Once) type of neural net. As the name suggests, this style of operation scans the image, looking at each area only once. Currently, Vantiq supports YOLO Version 2 and Version 3 models. YOLO Version 3 is reported to be more accurate, mostly for smaller objects, but we make no recommendation as to your choice of versions.

Objects that are identified by the model are returned with the following information:

  • label – the type of object identified
  • confidence – specifying on a scale of 0-1 how confident the neural net is that the identification is accurate,
  • location – containing the coordinates for the top,left, bottom, and right edges of the bounding box for the object.

Preparation of YOLO Models

Construction or acquisition of these models must done outside of Vantiq, and is beyond the scope of this document. Information about model construction and translation to TensorFlow can be found at the following locations

  • darknet – Information about building the model
  • darkflow – Translation from YOLO v2 to TensorFlow format.

Unfortunately, translation of YOLO v3 models to TensorFlow is not as straightforward. The darkflow system listed above does not yet support version 3. We have found the following mechanism(s) to provide reasonable results.

There will be other mechanisms that will work; this one is known to work in our system.

Note that the use of any of these mechanisms requires that the version of TensorFlow used to generate the model is compatible with (generally, less than or equal to) that used by Vantiq. The version used by Vantiq is available using the Resource.buildInfo() service on the system.tensorflowmodels resource type. This is described in the VAIL Rule and Procedure Reference Guide.

Describing the Model

As outlined here, a tensorflowmodel requires a model file (implemented via a ProtoBuf file or .pb file). This is a specification of the model graph used by TensorFlow.

Additionally, for YOLO models, we require a meta file (.meta file). The meta file contains data encoded as JSON, describing the training and interpretation of the model. While there is other information in the meta file, Vantiq makes use of the following.

  • net:
    • height and width – these contain the height & width of the expected input image.
    • Current YOLO implementations require identical values that are evenly divisible by 32.
  • num – this is the number of anchor boxes uses in the model.
    • A value of 5 indicates a YOLO version 2 model.
    • A value of 9 indicates a YOLO version 3 model.
    • If absent, it is calculate from the anchors property; if present, it is expected to match the number of anchor boxes provided.
  • anchors – the list of anchor boxes
    • A list of pairs of sizes representing the sizes of anchor boxes.
    • Anchor boxes are generated during model training and are used in determining each object’s bounding box.
    • As noted, anchors is a list of pairs so the number of anchor boxes is half of the size of this list.
  • labels – an ordered list of names of the objects found.
    • Running the model returns specific objects found only as a number; the labels property allows the objects found to be named with user-provided names rather than just numbers. The order of the labels in the list is important, as the labels list is referenced by the object index returned by the model to produce the object name.


At runtime, each model has a set of input and output operations. Generally, Vantiq can determine these from the model directly, but there may be cases where that fails. To provide for such cases, we support the following extension to the meta file.

  • vantiq – Vantiq extension used to provide information
    • inputOperations – a list of the names of the input operations
    • outputOperations – a list of the names of the output operations

Vantiq will provide information to and extract information from the model using these operation names if provided. Otherwise, use the operation names determined directly from the model.

Using YOLO Models

To analyze an image using such a model, use the TensorFlowOperation service. This is done as follows.

Assume that we have an image called targetImage.jpg, and a model, myModel with which to analyze it. Further, assume that we want objects identified only if the model’s confidence is at least 75%.
To perform the analysis, run the following VAIL code.

var yoloResults = TensorFlowOperation.processImage("targetImage.jpg", "myModel", 0.75)

Assuming that our image contained a car and a person, we might get a result back that looks like the following.

   {
     { confidence:0.79194605,
       location:[top:259.94803, left:622.9274, bottom:477.97113, right:897.4523],
       label:car
     },
     {
       confidence:0.8238598,
       location:[top:294.93753, left:342.35565, bottom:421.78534, right:404.92627],
       label:person
     }
   }

To analyze an image stored in a document, use the TensorFlowOperations.processDocument() procedure. The same style of results are returned.

For use in Apps, please see App tasks YOLO From Images and YOLO From Documents. The ConvertCoordinates task may also be of interest.

“Plain” TensorFlow Models

When presented with a YOLO-based TensorFlow model, Vantiq understands context and organization of the model. As such, it can process the model output, returning data in a manner optimized for the model’s purpose. The running of a YOLO version 3 model returns over 10,000 predictions; Vantiq understands the structure of YOLO model output and does the work to remove duplicates and predictions that are below the required confidence.

In the more general case, Vantiq can run the model, but it cannot pre- or post-process the model’s output. Consequently, users of these models must be prepared to interpret the output of the model.
Generally, these models may produce a large volume of output, so application must be prepared to perform the appropriate analysis.

Moreover, the interaction with these models requires a deeper understanding of the input and output needs of the model in question. Developers using models of type tensorflow/plain are assumed to have an understanding of their model and TensorFlow in general.

Using These Models

At a high level, the running of a TensorFlow model involves the execution of a number of TensorFlow operations. These operations are used to analyze input and produce the results as output. Input and output are delivered to and from named operations through the use of tensors. (This should not be interpreted as deep treatise on TensorFlow; we are merely providing enough terminology to understand the interface required.)

These tensors have a type and value. The dimension of a tensor (is it a scalar or multidimensional array) is determined at runtime, but must match the expectation of the model. A model’s users are expected to understand the input and output requirements for a executing that model.

TensorFlow’s tensor types are more specific than the type system used by VAIL. The Vantiq runtime system will adapt accordingly, so callers need only be aware of the type compatibility. The set of TensorFlow tensor types includes FLOAT, DOUBLE, INT32, INT64, BOOL, and STRING. Generally, a STRING to TensorFlow is a byte array.

To provide data to TensorFlow, it is best to come as close as possible to the TensorFlow type. That is, to provide data to a FLOAT or DOUBLE tensor, it is best to use a VAIL REAL; to provide data to an INT32 or INT64 tensor, the use of an INTEGER is preferable. Any underlying number type will work, but there may be more work involved and less precision.

Similarly, values returned from TensorFlow will use the most appropriate VAIL types: FLOAT or DOUBLE to VAIL REAL, INT32 or INT64 to VAIL INTEGER. Objects passed into or returned from TensorFlow will be objects of the form

{ tensorType: <one of the tensor types above>, value: <VAIL value> }

These will be converted to or from tensors as required.

To analyze an image using such a TensorFlow model, use the TensorFlowOperation Service. The specific calls used to run tensorflow/plain models are described here.

Using the information about input and output tensors here, we can see that calls might be done as follows. (This example also appears in the TensorFlowOperation Service description.)

For a simple example, assume we wish to process an image named mycar.jpg using a model named identifyCars. Further assume identifyCars supports three (3) input tensors:

  • ‘carPic’ – the image to analyze
  • ‘year’ – (optional) the year in which the car was manufactured
  • ‘country’ – (optional) the country of origin for the car.

and that identifyCars returns two (2) tensors:

  • modelName – a String, the model of car
  • manufacturer – a String, the car maker

We could then execute the simple version (leaving out the optional parameters) as follows:

var tfResult = TensorFlowOperations.executeTFModelOnImage(
                "mycar.jpg",
                "identifyCars",
                { targetTensorName: "carPic" })

The more complex version where all input parameters are provided would look like this:

var tfResult = TensorFlowOperations.executeTFModelOnImage(
                "mycar.jpg",
                "identifyCars",
                { targetTensorName: "carPic",
                  inputTensors: {
                        year: { tensorType: "int", value: 1980 },
                        country: { tensorType: "string", value: "USA"}
                    }
                 })

After execution of this code snippet, tfResults will be an object whose values might be (depending on the image in question)

{
    modelName: { tensorType: "string", value: "Fusion" },
    manufacturer: { tensorType: "string", value: "Ford" }
}

In the previous example, input and output tensors (with the exception of the input image) are scalars. That is, they are simple numbers or strings. Input or output tensors can, of course, be arrays.
Note that TensorFlow tensors are always simple types (listed above), and regular, meaning that all rows in an array have the same number of columns (and, of course, extending to any number of dimensions).

To see how this might be represented, we will extend this example a little. Assume that the identifyCars also returns (in a tensor named colors) the list of colors in which the car was originally available.

Using our same calling example above, tfResults will be an object whose values might be (depending on the image in question)

{
    modelName: { tensorType: "string", value: "Fusion" },
    manufactuer: { tensorType: "string", value: "Ford" },
    colors: { tensorType: "string", value: [ "red", "black", "chartreuse", "taupe" ] }
}

Here, we see that the colors returned is an array of strings.

This is a somewhat contrived example of the interactions required. More commonly, a model might return a large set of numbers that required post-processing. As noted previously, if we consider a YOLO version 3 model but identify it and run it as a tensorflow/plain model, things become more complicated than seen when executing the model as a YOLO-based model.

Just to understand what is more likely, the work performed by the YOLO model interpreter includes the following:

  • Convert the image from its native representation to a FLOAT tensor, resized to a smaller scale (typically 416 or 608 square)
    • Assuming a 416 square, this will be a set of data with dimension [1, 416, 416, 3] (1 image, with 416 rows of 416 cells of 3 colors, where each color is a floating point number (0..1) for red, green, and blue)
  • After running the model, get a FLOAT tensor with 10,647 predictions
  • From these 10,647 predictions
    • Drop those that do not pass the confidence requirement
    • Determine the “best” prediction to determine the best “bounding box” for the labeled object
    • Convert the internal representation of the object to a label (data in the meta file)
    • Return this data in the form expected from a YOLO model.

There is a good deal of work performed here. When running a tensorflow/plain model, the caller will have to take the returned tensor data (already converted to VAIL form) and interpret it according to the model specification and application needs. Vantiq cannot do this work as the semantics of data interpretation are model-specific.

To analyze an image stored in a document, use the TensorFlowOperations.executeTFModelOnDocument() procedure. To analyze sensor data (sets of numbers, etc.), use the TensorFlowOperations.executeTFModelOnTensors() procedure. (These procedures are all part of the TensorFlowOperation Service.) The same style of results are returned from each.

For use in Apps, please see App tasks Run TensorFlow Model On Image, Run TensorFlow Model On Document, and Run TensorFlow Model On Tensor.

Operational Restrictions

As noted, non-YOLO-based TensorFlow models can potentially return a very large data set (the aforementioned YOLO Version 3 model, when run as a tensorflow/plain model, will return 10,647 predictions, each of which is 85 FLOATS so, in VAIL, 85 REAL numbers). When converted, that works out about 8 megabytes. And this is not terribly large as these things go.

Consequently, running these models (tensorflow/plain) may be controlled using resource consumption limitations. In cloud installations (by default), results returned by these models are limited by the number of “items” (single items of data regardless of the organization) and the amount of memory consumed. The limits imposed here can be determined on a per-installation basis. If the limits are exceeded, the execution is terminated and an error returned.

By default, cloud installations will set these to 1000 “items” and 1 megabyte of memory.

Edge installations, by default, have no limits imposed. Such limits can be imposed (again, the decision is made on a per-installation basis), but are not by default.

From the point of view of overall system architecture, it generally makes sense (even with YOLO models), when dealing with large objects (images, etc), to put the processing of the object as close to the source as possible. Moreover, given an application’s particular needs, a private installation can be specifically configured with the resources required. NeuralNet models tend to be very compute (generally prefering GPU processors) and memory intensive; controlling the resource usage and allocation is more appropriately performed in a private installation.

Motion Tracking

Vantiq motion tracking interprets observations of things with locations as motion. In many cases, things with locations will come from neural net analysis, but they need not. If there is a property that contains information about location and another property that names the entity (the label property), motion tracking can track the entity.

Motion tracking allows consecutive observations of an entity’s position to be linked into a path. Additionally, we can determine the named region(s) in which a position is found, and the velocity (speed and/or direction) at which the entity is traveling. The following sections discuss these capabilities.

Concepts

Things With Locations

We have spoken of things with locations but have not formally defined them. Things with locations are entities that have a label and a location. They are structured as follows:

  • label – A property that labels the entity. Output from YOLO models will have a property named label, both other things with locations might use different property. The name of the property can be overridden using the labelProperty parameter.
  • location – A property that contains the location information. Output form YOLO models will have a property named location, but other things with locations might use a different property. The name of the property can be overridden using the coordinateProperty parameter.

The location property must contain the following properties:

  • top – The Y value of the top of the entity’s bounding box
  • left – The X value of the left side of the entity’s bounding box
  • bottom – the Y value of the bottom of the entity’s bounding box
  • right – the X value of the right side of the entity’s bounding box
  • centerX – the X value of the center of the entity’s bounding box
  • centerY – the Y value of the center of the entity’s bounding box

This describes a rectangle where the top, left corner is specified by top and left, and where the bottom, right corner is specified by bottom, and right.

You may notice that this assumes that bounding boxes’ edges are regular with respect to those of the coordinate system. The bounding box is considered to be a rectangle with top, left and bottom, right corners
specified from the properties above. That is, the required properties are designed to work with bounding boxes whose sides are parallel the coordinate system’s “sides”.

This is not always the case. A camera could be positioned so that the image coordinates describe a bounding box that lays out at an angle in the application’s coordinate system. When that is the case, the properties above are insufficient to describe the bounding box as a polygon in that system.

When that is the case, the following properties can be provided. These are generated by the Convert Coordinates activity when required.

  • tRight – The X value of the top, right corner of the rectangle
  • rTop – The Y value of the top, right corner of the rectangle
  • bLeft – The X value of the bottom, left corner of the rectangle
  • lBottom – The Y value of the bottom, left corner of the rectangle

These are of value as we look toward more complex applications. See the Application Design section for further discussion.

Application Coordinate System

Locations of entities are specified in terms of some coordinate system. YOLO image analysis activities provide location information in terms of the image’s coordinate system, but an application using more than one image source may need a coordinate system that can spans the various image sources. We refer to this as the application coordinate system. Without such a system, two separate images may report some entity (say, a car) at location (10, 20). This is the correct location with respect to the individual image(s), but these coordinates do not represent the same place in the observed world.

Using an application coordinate system is important in applications that obtain images from different places. Please see Multi-Camera Applications for further discussion.

Motion of Objects

Motion tracking in Vantiq has two steps:

In addition to these, we can predict an object’s location using the last known locations.

Track Motion

Compare the positions of the entities with those currently known. Based on the algorithm (see below), choose the best match and assign the appropriate tracking id to the entity. Assign entities that have not matched new tracking ids, and add them to the current set of known objects. Also assign a time of observation to the entity’s position.

Once entities are matched, check the set of known objects for objects that have been absent for too long. Drop those objects from the set of known objects.

Once complete, emit the current state from the activity (or return it from the procedure). The result has the set of tracked objects in the trackedObjects property, and the set of dropped objects in the droppedObjects property.

Parameters

Track Motion requires the following parameters.

  • state – The current set of tracked objects. A null value indicates no current state.
  • newObjects – The set of new objects with positions.
  • algorithmAlgorithm to use to determine motion.
  • qualifier – Value used to determine if two positions could be movement of the same entity.
  • maxAbsent – An interval after which an entity is considered missing. Missing objects are dropped from the set of known objects.
  • timeOfObservation – Time to assign to the observation. If unspecified, use the current time.
  • coordinateProperty – The name of the property from which to get the coordinates. The default value is location. This can be used if the input stores location information under a different property name.
  • labelProperty – The name of the property from which to extract the label. The default value is label. This can be used of the input labels things using a different property name.
  • trackingIdSourceProperty – The name of a property that is known to contain a unique value. If present, the value is used as the tracking id. If absent (or if the value for an instance is missing),
    then the tracking id is generated.

Input data consists of things with locations, as described above.

Algorithms

Motion tracking is performed using either of two (2) algorithms: centroid or bounding box. Both algorithms maintain the set of known objects and their last positions. As sets of new positions arrive,
compare the new positions to the old, determining which entity in the new set should be considered motion of the old entities. The remainder of this section describes that in more detail.

Both algorithms limit their comparisons to things with the same label. That is, if a car and a boat appear “near” one another in successive images, it is unreasonable to determine that the car moved to a boat.

Both algorithms include the notion of a qualifier. This further qualifies the comparison for purposes of determing movement. Each algorithm’s use of the qualifier will be noted below.

Centroid Algorithm

Compare the positions of the center of the bounding boxes of the two positions. The pair with the smallest euclidean distance (having identical labels) is considered motion.

The centroid algorithm uses the qualifier to specify the maximum distance an entity can travel and still be considered “motion.” If you are tracking cars approximately once per second, it is unreasonable to expect them to move 10 miles in a single second (under normal circumstances).

The centroid algorithm operates as follows:

  • For each new entity, compare it to the set of known entities.
  • If the two entities have the same label,
    • Determine the euclidean distance between the two entities
    • Find the closest object where the distance is not greater than the maximum distance
      • Those two objects are, then, considered movement from old to new.
      • Give them the same tracking id.
    • If there is no matching object, then consider this a new object for tracking purposes.
      • Assign it a new tracking id.
Bounding Box Algorithm

Compares the overlap of the bounding boxes for two entities with identical labels. The comparison resulting in the largest percentage overlap is considered motion.

The bounding box algorithm uses the qualifier to specify the minimum percentage overlap to be considered motion. So, if the qualifier is 0.50, that specifies that the bounding box from the new position must overlap by at least 50% with the old position to be considered motion.

The bounding box algorithm operates as follows:

  • For each new entity, compare it to the set of known entities.
    • If the two entities have the same label,
      • Determine the percentage overlap of the bounding boxes for old & new entities
      • Find the object with the highest overlap percentage (whose overlap percentage qualifyies)
        • Those two objects are, then, considered movement from old to new.
        • Give them the same tracking id.
    • If there is no matching object, then consider this a new object for tracking purposes.
      • Assign it a new tracking id.

In either case, the output of the track motion step is a set of tracked entities (trackedObjects), where each tracked object’s location contains the location information, the tracking id (trackingId),
the observation time (timeOfObservation).

If the trackingIdSourceProperty value is provided and that property contains a value, the value found will be used as the tracking id. If not, a tracking id will be generated. Where a unique value is known (for example, license tag numbers or facial recognition systems), the unique id can be used to track motion across areas that are disjoint.

Build and Predict Path

Takes the output from the track motion step and assemble paths for the tracked objects.

  • Find the path for that tracked object (determined by tracking id).
    • Add the new position to the end
    • If the maximum path size is exceeded, remove elements from the start of the pat
  • If no matching path is found, create a new tracked path.

At the end, return the set of known entities and their paths as well as a list of objects dropped and their predicted positions (if desired).

For objects for which there is no input (that is, track motion is no longer tracking the object), (optionally) return a predicted position. This will also remove the object from the list of actively tracked paths.

You can find a detailed output example in the Build And Predict Path section of the App Builder Reference Guide.

Parameters
  • state – The current set of tracked paths. A null value indicates no current state.
  • newObjects – The set of new objects with positions.
  • maxSize – (optional) The maximum path length Default value is 10.
  • pathProperty – (optional) Property name to use to store the path within the location object. Default is trackedPath.
  • coordinateProperty (optional) The name of the property from which to get the location information. Default is location.
  • doPredictions – (optional) Boolean indicating whether to predict the positions of objects dropped from tracking.
  • timeOfPrediction – (optional) Time to assign to predicted positions. If unspecified, use the current time.

Predicting Locations

PredictPositions

Given a path, we can predict the next location, extrapolating from last two known positions. To do so, we use MotionTracking.predictPositions(), which returns the list of paths with their predicted locations.

Parameters
  • pathsToPredict – The current set of tracked paths for which to predict next positions.
  • timeOfPrediction – (optional) Time to assign to predicted positions. If unspecified, use the current time.
  • pathProperty – (optional) Property name to use to store the path within the location object. Default is trackedPath.
PredictPositionBasedOnAge

We can also selectively predict positions based upon its age (or time we last saw some object). To do so, we use MotionTracking.predictPositionsBasedOnAge(). This procedure evaluates the candidatePaths against the expirationTime. For any paths whose last timeOfObservation is at or before the expirationTime, we predict then next position (based on timeOfPrediction) and return
that list. Paths after the expirationTime are ignored.

Parameters
  • candidatePaths – The current set of tracked paths for which to predict next positions.
  • expirationTime – The time representing the latest time considered expired.
  • timeOfPrediction – (optional) Time to assign to predicted positions. If unspecified, use the current time.
  • pathProperty – (optional) Property name to use to store the path within the location object. Default is trackedPath.

Tracking Regions

Track Regions provide the ability name regions within the coordinate system used by applications in a namespace. A detailed reference can be found in the Tracking Regions section of the Resource Reference Guide. Tracking regions have the following properties:

  • name – the name of the tracking region.
  • boundary – an Object containing a list of (at most 4) points that comprise the boundary of the region.
  • distance – an Object containing the following properties:
    • points – a list of two points
    • distance – the distance between these two points
  • direction – an Object containing the following properties
    • points – a list of two points
    • direction – the direction (compass degrees between 0 and 360) that movement from the first point to the second represents.

In each of the above, specify points with an x and y component (alternately, you can use lon and lat, respectively). Again, for details, please see Tracking Regions section.

Tracking regions need not be mutually exclusive. That is, some particular location may be found in many different tracking regions (often shortened to regions) or none. For example, if we consider a traffic intersection, a single location could, quite reasonably, be contained in all of the following regions:

  • The intersection
  • The cross street
  • The crosswalk

All regions in a namespace are expected to be in the same coordinate system. A coordinate system here refers to a consistent set of coordinates that provide the location information. See the Application Coordinate System and Multi-Camera Applications for further discussion.

By default, the set of regions in a namespace comprises the region search space for all applications in that namespace. That said, it is possible, within a single namespace, to have sets of tracking regions that the application considers disjoint. For example, consider a set of cameras that track the motion of objects in a set of completely separate buildings. Any given camera is known to belong to a particular buildings. In such an environment, it may not be desirable to create a coordinate system that maps all buildings separately within the coordinate space. Instead, we may consider a set of regions for each building or set of buildings – these buildings or sets thought of as having overlapping coordinate systems. When choosing to do this, it is important to ensure that the sets in use are completely disjoint.

To do this, two things are necessary. First, each such set of regions must be named in such a way that the specific set can be determined. Second, each such set of regions must include the distance and direction properties if velocity is expected to be determined.

The region search space can be restricted through the use of the trackingRegionFilter property on the BuildAndPredictPath and PredictPathsByAge activity patterns when building apps. When using VAIL code to find regions, the list of regions to consider is passed to MotionTracking.findRegionsForCoordinate().

If an application is making use of the ability to use the same coordinate space for what it considers disjoint sets of tracking regions, the application is responsible for ensuring that these disjoint sets are always used consistently, and that the distance and direction properties are consistent and properly present in each such set where velocity information is expected.

An entity’s location is determined to be in a particular region if that entity’s bounding box’s centroid is located within or on the border of a region (i.e. it is not outside the region). Note that this means that the entire bounding box need not be contained within the region.

As an example, consider the following image.

Traffic Image

To our application, this image may have a number of named areas of interest.

In the image below, we can imagine (and roughly draw) a regions defining

  • the intersection (white),
  • the main street (green), and
  • the bike path (magenta).

Annotated Traffic Image

If we look at the some of the objects contained in the image below, we see the objects labeled and color-coded by region. We have one bicycle with person (boxed with magenta) is in the bike path region, the car (white) is in the intersection and main street regions, and the van (green, labeled ‘car’ – the model does not know about vans) is in the main street region. The second bicycle (with person) is not in any region.

boxed Traffic Image

In addition to named areas, regions provide the basis for determining the velocity information about moving entities. Regions can provide direction and distance information. The direction component of a region says that the two points specified define a compass direction. That is, anything “traveling” from the first to second points is traveling in the listed compass direction (0 is north, 90 east, etc.). From this direction, we can determine the direction of movement for any pair of points.

Similarly, the distance component of a region specifies the distance between the two points. With this information, we can determine the distance between any two points. The units (feet, kilometers, parsecs, etc.) is assumed to be known to the application(s) making use of the regions.

Because the set of regions share a coordinate system, we require only one instance of distance and direction in a set of regions. If more than one exists, which is used is not defined. (See Application Design for more information.)

As part of the path information provided, report a velocity Object. The velocity property contains a direction property and a distance property. If no region provides distance or direction information,
report an empty velocity object. If only direction is provided, report only the direction information. Similarly, if the set of regions provide only distance information, report only the distance information.

Application Design

As you build applications involving motion tracking, please consider the following.

Model Capabilities

As noted above, the motion tracking algorithms use the entity label to determine which objects are potentially instances of motion. Use of a discriminating model (a model that has fine-grained object identification) will help motion tracking to reduce the search space. Consider a camera overlooking a highway. If the model says that it sees cars, there are likely to be a great many cars, meaning that all the car positions will have to be compared.

If, on the other hand, your model reports cars with colors, models, or both (e.g. silver car, DeLorean car, or silver DeLorean car), then the search space for particular labels will be smaller.

Motion Tracking Algorithms

As noted in Algorithms, there are two ways to determine motion. You may have to experiment to determine which is preferable for your application. If all the cameras have a similar view (i.e. all cameras are 10 meters up and have a similar angle of view), then the bounding box algorithm may be applicable. If, however, different cameras have a different view (say one from overhead, one from 3 meters up), their notion of the bounding boxes may be different, and the centroid algorithm may be more appropriate.

Scaling Motion Tracking Applications

Generally, motion tracking does not involve database activity. The region list will be read on first use and updated rarely in a production application (the update time is based on the refreshInterval of the BuildAndPredictPath activity). However, memory storage is required for the set of tracked objects and their paths. The size here depends upon the maximumPathSize property, and, of course, the number of tracked objects. Also, updates require traversing the set of known objects. Consequently, it is important to manage this state.

Within Vantiq applications, the workload is often divided using Unwind and Split By Group activities. To use these with motion tracking applications, please keep the following in mind. First, as noted in the Algorithms section, labels are used to refine the search space. Consequently, it is usually important that all entities with the same label are processed by the same motion tracking components. Since we do not known a priori where an object will be going, we will have to take care using any location information to split the workload. If, say, we used some part of the location to split the workload, something moving from one split group to another would no longer be tracked as motion, since the state would not be shared. Some applications may be able to use that, based upon their knowledge of entite movement, but that is not generally true.

Similarly, care should be taken with the Unwind task. The algorithms make use of the grouping of things, generally assuming that a set of things processed at the same time come from the same image. Consequently, they use relative comparisons to determine the best match. “Closest” (i.e. shortest euclidean distance or greatest bounding box overlap) is comparative, with other values from the same image.
When working with the output of YOLO activities (YOLO From Images or YOLO From Documents), the preferred mechanism is the use the groupByLabel boolean for those activity patterns. This will emit a set of groups (each group having common label content), and split based on that label.

Multi-Camera Applications

Using more than one camera can present challenges to the application designer. If the cameras are disjoint (no entity will reasonably travel from one camera to another), then there are few issues to deal with.
If, on the other hand, the cameras are observing a space where entities may move from camera to camera, then the application must be designed with this in mind.

To aid such an application, you should look for an application coordinate system for the application.

Consider a set of cameras observing some road. Each camera provides images, and the YOLO image analysis will find the entities within those images, reporting their locations based on image coordinates.
If each camera finds a car at location (10, 10), those are not really located at the same place – since images from camera 1 are, say, 50 meters down the road from those of camera 2. In this environment, convert the entity locations to the application coordinate system to correctly locate these cars within the physical space observed by the application.

This function can be performed by the Convert Coordinates activity. In this case, to design the application, develop an application coordinate system, and specify the conversion of each camera’s image-based coordinates to that coordinate system.

Design of the application coordinate system is beyond the scope of this document. However, you should consider that the coordinates used should cover the observed area, taking into account any dead areas (areas not observed by any camera). GPS coordinates are a possibility.

Once the coordinate system is designed, a Convert Coordinates activity should be included to convert the image coordinates provided by, say, the YOLO-based analysis activies into the application coordinate system. Once entities are located in the application coordinate system, motion tracking can occur in that space. Since locations are now in the global space, entities that move from one camera to another can be appropriately tracked.

Note: if there are gaps in camera coverage (say between cameras 1 & 2), an object moving from camera 1 to camera 2 may be missing for a short while. Here, missing means that it does not appear in any image.
The Track Motion’s maxAbsentBeforeMissing time should be long enough to allow entities to move from camera to camera.

Note also that the considerations above regarding a model’s level of detail still apply.

Taking these elements into consideration, we might design a 3-camera application as follows.

Multi Camera Application

In this application, we see 3 event streams (Camera1, Camera2, and Camera3) that handle the arrival of images from the three cameras. For each stream, locate the entities within their respective images, each YOLO activity having the groupByLabel box checked.

That completed, convert each stream’s image coordinates into the application coordinages (using the Convert Coordinates activity (CC1, CC2, and CC3)). Once the application coordinates are available, split the work (by label) and perform motion tracking.

In this application, define the TrackingRegions using the application coordinate system. This provides velocity and regions based on that system.

References


TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc.

Tensorflow

The tus protocol specification

YOLO Research