Search for answers or browse our knowledge base.
Image Processing Reference Guide
Image processing in Vantiq allows applications to make use of images as another type of sensor input. Just as various sensors on equipment can report their temperature or other facts that let an application monitor their ongoing state, images can be used to gather information where physical state sensors are not appropriate or available.
Consider a town attempting to determine if some area (walkway, street, etc.) is occupied. While it may be possible to instrument every sidewalk with pressure sensors every few inches, it is much more efficient to have cameras provide periodic images of the areas in question. Also, consider a factory setting where there may be security or worker health considerations for people in designated areas. Information provided by cameras in these areas can be used to determine if and what situations may exist and who should be notified.
Image information, however, does not directly provide the facts that are needed by an application. Instead, the images need to be analyzed to produce the information relevant to the application.
Generally speaking, the means by which this analysis happens is via a neural net. A neural net (or, more formally, neural network) is a set of algorithms modeled loosely after the human brain. They interpret data, recognizing patterns, and a particular neural net model is trained to interpret these patterns for some particular purpose. These purposes include, but are not limited to, object recognition, face recognition, etc. So the neural net model is, effectively, part of the application.
To build an application utilizing image processing, there are some general capabilities that are necessary. The application needs to be able to acquire, manipulate, and analyze images. The remainder of this document describes these capabilities within the Vantiq system.
Overview
The Vantiq system has a set of resources and services that are relevant to this area.
- Images provide a Vantiq resource in which images are stored.
- Images can be manipulated using the VisionScriptBuilder and VisionScriptOperation services.
- TensorFlowModels provide a Vantiq resource in which TensorFlow models are stored for use in the application.
- Images can be analyzed using the TensorFlowOperation service.
- Note: Images that are stored as Documents can be analyzed as well.
The following sections describe these in more detail.
Image Acquisition
Image acquisition in the Vantiq system may be accomplished in a variety of ways. External programs can load images as described in the Images section of the Resource Guide.
Vantiq also offers an Enterprise Connector called the Object Recognition Connector that can be used to monitor a camera, fetch the image, and supply it to the system.
Image Manipulation
As images appear in the system, it is often desirable to manipulate the images. They may need to be resized, converted to black & white, labeled, or entities within the image identified. To perform these operations, the Vantiq system uses the VisionScriptOperation service.
Vision Script
A Vision Script consists of a series of actions that are performed on an image. The actions available are as follows:
The vision script, via the VisionScriptOperation service, processes a specific image. The image in question becomes the working image, and subsequent actions can get information about the current working image or change it. Each action is applied to the image, and the results of that action are available either as the working image or as a result object.
The object returned from the service has a property for each action that provides a result, where the property name is action’s tag property, or, if that is missing, by the action’s name.
Vision Script Structure
Vision scripts in VAIL can be constructed using the VisionScriptBuilder service. Alternatively, they can be constructed directly. A Vision Script object consists of the following properties.
- scriptName – Optional name of the script. Primarily used for debugging.
- script – Array of actions that comprise the script
Each action consists of the following properties.
- name – String that identifies the action to be performed. The actions and their names are described in more detail in the subsequent sections.
- tag – Optional String containing an identity for the action. This is used to reference that action’s results. If the
tag
is not provided, the action’s results can be identified by its name. - parameters – Object containing the specific parameters for this particular action. The parameters for each action are described with the individual action.
Convert to Grayscale Action
This action converts an image to grayscale or black & white.
- name –
convertToGrayscale
- parameters – None
The result of this action is that the working image is converted to grayscale.
Crop Action
This action replaces the current working image with the part of the working image identified by the parameters provided. The portion of the image to which to crop is identified by the upper left corner, and a height & width.
- name –
crop
- parameters
x
– Integer x coordinate of the top, left corner of the portion to which to cropy
– Integer y coordinate of the top, left corner of the portion to which to cropwidth
– Integer width of the area to which to crop.height
– Integer height of the area to which to crop
The result of this action is that the working image is replaced by the cropped portion.
Describe Action
This action provides information about the current working image. The information provided includes whether the image is empty, the size (height & width) of the image, and the number of channels (3 or more for color, 1 for grayscale).
- name –
describe
- parameters – None
The result of this action is an entry in the results with the name of the tag (if present) or the name describe
containing the data outlined above.
Draw Boxes Action
This action draws boxes on the working image. This is often used to call attention to objects identified by the analysis.
- name –
drawBoxes
- parameters
boxList
– List of boxes to draw on the working image. Each box is described as follows.x
– Integer X coordinate of upper left corner of boxy
– Integer Y coordinate of upper left corner of boxwidth
– Integer width of box, measured (right) from X coordinateheight
– Integer height of box, measure (down) from Y coordinatethickness
– Optional integer thickness of the box boundary. If not present, defaults to 2.color
– Optional object containing 3 Integers (values 0-255): red, green, and blue. For example, { red: 128, green: 128, blue: 128 }. If not present, defaults to red.label
– Optional String with which to label the boxfont
– Optional String identifiying font to use (see Draw Text Action for details)isItalic
– Optional Boolean indicating whether to italicize text (if possible).
The result of this action is that the working image is replaced by one with the indicated boxes added.
Draw Boxes Using Previous Results Action
This action draws boxes on the working image, obtaining the list of boxes from a previous action. This is often used to call attention to objects identified by the analysis.
- name –
drawBoxes
- parameters
useResultsFrom
– String name of the previous action from which to obtain the box list. The name used here is either thetag
from the previous action, or, if notag
was used, thename
of the previous action.
The result of this action is that the working image is replaced by one with the indicated boxes added.
Draw Text Action
This action puts a text message on the working image.
- name –
drawText
- parameters
x
– Integer x coordinate for start of texty
– Integer y coordinate for start of textfont
– String naming the font to use for the text. Font choices are controlled by the underlying platform, and are as follows:HERSHEY_PLAIN
HERSHEY_COMPLEX
HERSHEY_TRIPLEX
(Italics ignored)HERSHEY_SIMPLEX
HERSHEY_DUPLEX
HERSHEY_COMPLEX_SMALL
HERSHEY_SCRIPT_SIMPLEX
HERSHEY_SCRIPT_COMPLEX
isItalic
– Boolean indicating whether to italicize the text (if possible)thickness
– Integer thickness in pixels of the textfontScale
– Real number by which to scale the textcolor
– Object containing 3 Integers (values 0-255): red, green, and blue. For example,{ red: 128, green: 128, blue: 128 }
. If not present, defaults to red ({red: 255, green: 0, blue: 0}
).
The result of this action is that the working image is replaced by one with the indicated text added.
Find Faces Action
This action locates faces in an image.
- name –
findFaces
- parameters – None
The results of this action are placed in the service result, identified by the tag, or name if no tag was provided. For each face found, the action result will contain an array entry describing a box outlining the face found. The box description consists of the x and y coordinates (top, left corner), and height and width.
Resize Action
This action replaces the current working image with the same image downsized to the size requested.
- name –
resize
- parameters
width
– Integer width to which to resize. Width must be less than or equal to the width of the current working image.height
– Integer height to which to resize. Width must be less than or equal to the height of the current working image.
The result of this action is that the working image is replaced by the resized image.
Save Action
This action saves the current working image into an image.
- name –
save
- parameters
saveName
– Optional String providing the name to which to save the image. If not provided, the image on which this script is operating will be overwritten.fileType
– Optional String providing the Mime type to be used (e.g.image/png
, etc.). If not provided, the fileType of the image on which this script is operation will be used.
The result of this action is that the current working image is written as an instance of Images. A script that does not include a save
action will result in no changes being made to the image. The result object returned will also contain the image saved (identified by the tag if provided, or action name if not). The returned object will include the name and fileType of the image saved.
Using Vision Script
As an example, we will create a procedure that runs a VisionScript. This script will
- convert the image to grayscale,
- save that image,
- find faces in that image,
- draw boxes around the faces, and
- save the image with boxes
For our example, we will run these actions over this image.
The procedure is as follows.
PROCEDURE vsExample(imageName String)
// Create the script
var script = { scriptName: "FindingFaces"}
// Build a convertToGrayscale action
var convertAction = { name: "convertToGrayscale"}
// Save that image to an image named gsSave.jpg
var saveGrayScaleAction = { name: "save"}
var saveGSParams = { saveName: "gsSave.jpg", fileType: "image/jpeg"}
saveGrayScaleAction.parameters = saveGSParams
// Find the faces in the image & draw boxes
var ffAction = { name: "findFaces", tag: "locateFaces" }
var drawBoxesAction = { name: "drawBoxes"}
// Use results from the findFaces action to draw boxes on our image
var dbParams = { useResultsFrom: "locateFaces"}
drawBoxesAction.parameters = dbParams
// Save the resulting image to boxedSave.jpg
var saveBoxedAction = {name: "save", tag: "boxedSave"}
var saveBoxedParams = { saveName: "boxedSave.jpg", fileType: "image/jpeg"}
saveBoxedAction.parameters = saveBoxedParams
script.script = [convertAction,
saveGrayScaleAction,
ffAction,
drawBoxesAction,
saveBoxedAction]
var result = VisionScriptOperation.processImage(imageName, script)
return result
When we run this procedure, we will see results that look approximately like this.
{
"locateFaces": [
{
"height": 57,
"width": 57,
"x": 224,
"y": 16
},
{
"height": 60,
"width": 60,
"x": 75,
"y": 47
}
],
"save": {
"name": "gsSave.jpg",
"fileType": "image/jpeg",
"contentSize": 77543,
"ars_modifiedAt": "2019-07-15T20:47:44.189Z",
"ars_modifiedBy": "fhc",
"content": "/pics/gsSave.jpg"
},
"boxedSave": {
"name": "boxedSave.jpg",
"fileType": "image/jpeg",
"contentSize": 78369,
"ars_modifiedAt": "2019-07-15T20:47:44.203Z",
"ars_modifiedBy": "fhc",
"content": "/pics/boxedSave.jpg"
}
}
There are a few things to note about the results here. First, we see the results of the findFaces
action. Since we provided a tag
for that action, the results are identified by the tag
value. Note that the vision script identified the previous action results for the drawBoxes
action using the tag
value locateFaces
.
This vision script object has two (2) save
actions in it, one with a tag and one without. We see the results of both actions – one identified by the tag
value boxedSave
, and the other is identified by its action name save
since no tag was provided. If no tag were provided for either save
action, the results would contain only the results from the last save
action. This is because only one result with a given name can be returned.
The two saved images produced are the simple grayscale version
and the image with faces marked.
In this case, since the image is grayscale, the color of the boxes is black (rather than the default color of red).
Image Analysis
The Vantiq system provides the means to employ specific neural net models to perform analysis of complex data as part of an application. Specifically, the Vantiq system provides the ability to run TensorFlow models. TensorFlow models can be used to process data that is available to the Vantiq system.
TensorFlow models must be provided to the system, as described here.
YOLO-based TensorFlow Models
For detection of objects within an image, Vantiq provides support for YOLO-based TensorFlow models. These are designed to efficiently process an image, identifying entities within that image based on the model’s training.
YOLO-based TensorFlow models are identified by the modelType of tensorflow/yolo. (See the Resource Reference Guide for more information.)
Vantiq uses a TensorFlow implementation of YOLO (You Only Look Once) type of neural net. As the name suggests, this style of operation scans the image, looking at each area only once. Currently, Vantiq supports YOLO Version 2 and Version 3 models. YOLO Version 3 is reported to be more accurate, mostly for smaller objects, but we make no recommendation as to your choice of versions.
Objects that are identified by the model are returned with the following information:
- label – the type of object identified
- confidence – specifying on a scale of 0-1 how confident the neural net is that the identification is accurate,
- location – containing the coordinates for the
top
,left
,bottom
, andright
edges of the bounding box for the object.
Preparation of YOLO Models
Construction or acquisition of these models must done outside of Vantiq, and is beyond the scope of this document. Information about model construction and translation to TensorFlow can be found at the following locations
- darknet – Information about building the model
- darkflow – Translation from YOLO v2 to TensorFlow format.
Unfortunately, translation of YOLO v3 models to TensorFlow is not as straightforward. The darkflow system listed above does not yet support version 3. We have found the following mechanism(s) to provide reasonable results.
- Use darkflow to generate the meta file.
- (Alternately, construct the meta file manually providing the information Vantiq needs.)
- Then, use tensorflow-yolov3 to produce the
.pb
file from the darknet.weights
file.
There will be other mechanisms that will work; this one is known to work in our system.
Note that the use of any of these mechanisms requires that the version of TensorFlow used to generate the model is compatible with (generally, less than or equal to) that used by Vantiq. The version used by Vantiq is available using the Resource.buildInfo()
service on the system.tensorflowmodels
resource type. This is described in the VAIL Rule and Procedure Reference Guide.
Describing the Model
As outlined here, a tensorflowmodel requires a model file (implemented via a ProtoBuf file or .pb
file). This is a specification of the model graph used by TensorFlow.
Additionally, for YOLO models, we require a meta file (.meta
file). The meta file contains data encoded as JSON, describing the training and interpretation of the model. While there is other information in the meta file, Vantiq makes use of the following.
net
:height
andwidth
– these contain the height & width of the expected input image.- Current YOLO implementations require identical values that are evenly divisible by 32.
num
– this is the number of anchor boxes uses in the model.- A value of 5 indicates a YOLO version 2 model.
- A value of 9 indicates a YOLO version 3 model.
- If absent, it is calculate from the
anchors
property; if present, it is expected to match the number of anchor boxes provided.
anchors
– the list of anchor boxes- A list of pairs of sizes representing the sizes of anchor boxes.
- Anchor boxes are generated during model training and are used in determining each object’s bounding box.
- As noted,
anchors
is a list of pairs so the number of anchor boxes is half of the size of this list.
labels
– an ordered list of names of the objects found.- Running the model returns specific objects found only as a number; the
labels
property allows the objects found to be named with user-provided names rather than just numbers. The order of the labels in the list is important, as thelabels
list is referenced by the object index returned by the model to produce the object name.
- Running the model returns specific objects found only as a number; the
At runtime, each model has a set of input and output operations. Generally, Vantiq can determine these from the model directly, but there may be cases where that fails. To provide for such cases, we support the following extension to the meta file.
vantiq
– Vantiq extension used to provide informationinputOperations
– a list of the names of the input operationsoutputOperations
– a list of the names of the output operations
Vantiq will provide information to and extract information from the model using these operation names if provided. Otherwise, use the operation names determined directly from the model.
Using YOLO Models
To analyze an image using such a model, use the TensorFlowOperation service. This is done as follows.
Assume that we have an image called targetImage.jpg
, and a model, myModel
with which to analyze it. Further, assume that we want objects identified only if the model’s confidence is at least 75%.
To perform the analysis, run the following VAIL code.
var yoloResults = TensorFlowOperation.processImage("targetImage.jpg", "myModel", 0.75)
Assuming that our image contained a car and a person, we might get a result back that looks like the following.
{
{ confidence:0.79194605,
location:[top:259.94803, left:622.9274, bottom:477.97113, right:897.4523],
label:car
},
{
confidence:0.8238598,
location:[top:294.93753, left:342.35565, bottom:421.78534, right:404.92627],
label:person
}
}
To analyze an image stored in a document, use the TensorFlowOperations.processDocument()
procedure. The same style of results are returned.
For use in Apps, please see App tasks YOLO From Images and YOLO From Documents. The ConvertCoordinates task may also be of interest.
“Plain” TensorFlow Models
When presented with a YOLO-based TensorFlow model, Vantiq understands context and organization of the model. As such, it can process the model output, returning data in a manner optimized for the model’s purpose. The running of a YOLO version 3 model returns over 10,000 predictions; Vantiq understands the structure of YOLO model output and does the work to remove duplicates and predictions that are below the required confidence.
In the more general case, Vantiq can run the model, but it cannot pre- or post-process the model’s output. Consequently, users of these models must be prepared to interpret the output of the model.
Generally, these models may produce a large volume of output, so application must be prepared to perform the appropriate analysis.
Moreover, the interaction with these models requires a deeper understanding of the input and output needs of the model in question. Developers using models of type tensorflow/plain
are assumed to have an understanding of their model and TensorFlow in general.
Using These Models
At a high level, the running of a TensorFlow model involves the execution of a number of TensorFlow operations. These operations are used to analyze input and produce the results as output. Input and output are delivered to and from named operations through the use of tensors. (This should not be interpreted as deep treatise on TensorFlow; we are merely providing enough terminology to understand the interface required.)
These tensors have a type and value. The dimension of a tensor (is it a scalar or multidimensional array) is determined at runtime, but must match the expectation of the model. A model’s users are expected to understand the input and output requirements for a executing that model.
TensorFlow’s tensor types are more specific than the type system used by VAIL. The Vantiq runtime system will adapt accordingly, so callers need only be aware of the type compatibility. The set of TensorFlow tensor types includes FLOAT, DOUBLE, INT32, INT64, BOOL, and STRING. Generally, a STRING to TensorFlow is a byte array.
To provide data to TensorFlow, it is best to come as close as possible to the TensorFlow type. That is, to provide data to a FLOAT or DOUBLE tensor, it is best to use a VAIL REAL; to provide data to an INT32 or INT64 tensor, the use of an INTEGER is preferable. Any underlying number type will work, but there may be more work involved and less precision.
Similarly, values returned from TensorFlow will use the most appropriate VAIL types: FLOAT or DOUBLE to VAIL REAL, INT32 or INT64 to VAIL INTEGER. Objects passed into or returned from TensorFlow will be objects of the form
{ tensorType: <one of the tensor types above>, value: <VAIL value> }
These will be converted to or from tensors as required.
To analyze an image using such a TensorFlow model, use the TensorFlowOperation Service. The specific calls used to run tensorflow/plain
models are described here.
Using the information about input and output tensors here, we can see that calls might be done as follows. (This example also appears in the TensorFlowOperation Service description.)
For a simple example, assume we wish to process an image named mycar.jpg using a model named identifyCars. Further assume identifyCars supports three (3) input tensors:
- ‘carPic’ – the image to analyze
- ‘year’ – (optional) the year in which the car was manufactured
- ‘country’ – (optional) the country of origin for the car.
and that identifyCars returns two (2) tensors:
modelName
– a String, the model of carmanufacturer
– a String, the car maker
We could then execute the simple version (leaving out the optional parameters) as follows:
var tfResult = TensorFlowOperations.executeTFModelOnImage(
"mycar.jpg",
"identifyCars",
{ targetTensorName: "carPic" })
The more complex version where all input parameters are provided would look like this:
var tfResult = TensorFlowOperations.executeTFModelOnImage(
"mycar.jpg",
"identifyCars",
{ targetTensorName: "carPic",
inputTensors: {
year: { tensorType: "int", value: 1980 },
country: { tensorType: "string", value: "USA"}
}
})
After execution of this code snippet, tfResults
will be an object whose values might be (depending on the image in question)
{
modelName: { tensorType: "string", value: "Fusion" },
manufacturer: { tensorType: "string", value: "Ford" }
}
In the previous example, input and output tensors (with the exception of the input image) are scalars. That is, they are simple numbers or strings. Input or output tensors can, of course, be arrays.
Note that TensorFlow tensors are always simple types (listed above), and regular, meaning that all rows in an array have the same number of columns (and, of course, extending to any number of dimensions).
To see how this might be represented, we will extend this example a little. Assume that the identifyCars also returns (in a tensor named colors) the list of colors in which the car was originally available.
Using our same calling example above, tfResults
will be an object whose values might be (depending on the image in question)
{
modelName: { tensorType: "string", value: "Fusion" },
manufactuer: { tensorType: "string", value: "Ford" },
colors: { tensorType: "string", value: [ "red", "black", "chartreuse", "taupe" ] }
}
Here, we see that the colors returned is an array of strings.
This is a somewhat contrived example of the interactions required. More commonly, a model might return a large set of numbers that required post-processing. As noted previously, if we consider a YOLO version 3 model but identify it and run it as a tensorflow/plain
model, things become more complicated than seen when executing the model as a YOLO-based model.
Just to understand what is more likely, the work performed by the YOLO model interpreter includes the following:
- Convert the image from its native representation to a FLOAT tensor, resized to a smaller scale (typically 416 or 608 square)
- Assuming a 416 square, this will be a set of data with dimension [1, 416, 416, 3] (1 image, with 416 rows of 416 cells of 3 colors, where each color is a floating point number (0..1) for red, green, and blue)
- After running the model, get a FLOAT tensor with 10,647 predictions
- From these 10,647 predictions
- Drop those that do not pass the confidence requirement
- Determine the “best” prediction to determine the best “bounding box” for the labeled object
- Convert the internal representation of the object to a label (data in the meta file)
- Return this data in the form expected from a YOLO model.
There is a good deal of work performed here. When running a tensorflow/plain
model, the caller will have to take the returned tensor data (already converted to VAIL form) and interpret it according to the model specification and application needs. Vantiq cannot do this work as the semantics of data interpretation are model-specific.
To analyze an image stored in a document, use the TensorFlowOperations.executeTFModelOnDocument()
procedure. To analyze sensor data (sets of numbers, etc.), use the TensorFlowOperations.executeTFModelOnTensors()
procedure. (These procedures are all part of the TensorFlowOperation Service.) The same style of results are returned from each.
For use in Apps, please see App tasks Run TensorFlow Model On Image, Run TensorFlow Model On Document, and Run TensorFlow Model On Tensor.
Operational Restrictions
As noted, non-YOLO-based TensorFlow models can potentially return a very large data set (the aforementioned YOLO Version 3 model, when run as a tensorflow/plain
model, will return 10,647 predictions, each of which is 85 FLOATS so, in VAIL, 85 REAL numbers). When converted, that works out about 8 megabytes. And this is not terribly large as these things go.
Consequently, running these models (tensorflow/plain
) may be controlled using resource consumption limitations. In cloud installations (by default), results returned by these models are limited by the number of “items” (single items of data regardless of the organization) and the amount of memory consumed. The limits imposed here can be determined on a per-installation basis. If the limits are exceeded, the execution is terminated and an error returned.
By default, cloud installations will set these to 1000 “items” and 1 megabyte of memory.
Edge installations, by default, have no limits imposed. Such limits can be imposed (again, the decision is made on a per-installation basis), but are not by default.
From the point of view of overall system architecture, it generally makes sense (even with YOLO models), when dealing with large objects (images, etc), to put the processing of the object as close to the source as possible. Moreover, given an application’s particular needs, a private installation can be specifically configured with the resources required. NeuralNet models tend to be very compute (generally prefering GPU processors) and memory intensive; controlling the resource usage and allocation is more appropriately performed in a private installation.
Motion Tracking
Vantiq motion tracking interprets observations of things with locations as motion. In many cases, things with locations will come from neural net analysis, but they need not. If there is a property that contains information about location and another property that names the entity (the label property), motion tracking can track the entity.
Motion tracking allows consecutive observations of an entity’s position to be linked into a path. Additionally, we can determine the named region(s) in which a position is found, and the velocity (speed and/or direction) at which the entity is traveling. The following sections discuss these capabilities.
Concepts
Things With Locations
We have spoken of things with locations but have not formally defined them. Things with locations are entities that have a label and a location. They are structured as follows:
- label – A property that labels the entity. Output from YOLO models will have a property named
label
, both other things with locations might use different property. The name of the property can be overridden using thelabelProperty
parameter. - location – A property that contains the location information. Output form YOLO models will have a property named
location
, but other things with locations might use a different property. The name of the property can be overridden using thecoordinateProperty
parameter.
The location property must contain the following properties:
top
– The Y value of the top of the entity’s bounding boxleft
– The X value of the left side of the entity’s bounding boxbottom
– the Y value of the bottom of the entity’s bounding boxright
– the X value of the right side of the entity’s bounding boxcenterX
– the X value of the center of the entity’s bounding boxcenterY
– the Y value of the center of the entity’s bounding box
This describes a rectangle where the top, left corner is specified by top
and left
, and where the bottom, right corner is specified by bottom
, and right
.
You may notice that this assumes that bounding boxes’ edges are regular with respect to those of the coordinate system. The bounding box is considered to be a rectangle with top, left and bottom, right corners
specified from the properties above. That is, the required properties are designed to work with bounding boxes whose sides are parallel the coordinate system’s “sides”.
This is not always the case. A camera could be positioned so that the image coordinates describe a bounding box that lays out at an angle in the application’s coordinate system. When that is the case, the properties above are insufficient to describe the bounding box as a polygon in that system.
When that is the case, the following properties can be provided. These are generated by the Convert Coordinates activity when required.
tRight
– The X value of the top, right corner of the rectanglerTop
– The Y value of the top, right corner of the rectanglebLeft
– The X value of the bottom, left corner of the rectanglelBottom
– The Y value of the bottom, left corner of the rectangle
These are of value as we look toward more complex applications. See the Application Design section for further discussion.
Application Coordinate System
Locations of entities are specified in terms of some coordinate system. YOLO image analysis activities provide location information in terms of the image’s coordinate system, but an application using more than one image source may need a coordinate system that can spans the various image sources. We refer to this as the application coordinate system. Without such a system, two separate images may report some entity (say, a car) at location (10, 20). This is the correct location with respect to the individual image(s), but these coordinates do not represent the same place in the observed world.
Using an application coordinate system is important in applications that obtain images from different places. Please see Multi-Camera Applications for further discussion.
Motion of Objects
Motion tracking in Vantiq has two steps:
- Track Motion (via the Track Motion activity pattern or the
MotionTracking.trackMotion()
service), and - Build Path (via the Build and Predict Path activity pattern or the
MotionTracking.buildAndPredictPath()
service).
In addition to these, we can predict an object’s location using the last known locations.
Track Motion
Compare the positions of the entities with those currently known. Based on the algorithm (see below), choose the best match and assign the appropriate tracking id to the entity. Assign entities that have not matched new tracking ids, and add them to the current set of known objects. Also assign a time of observation to the entity’s position.
Once entities are matched, check the set of known objects for objects that have been absent for too long. Drop those objects from the set of known objects.
Once complete, emit the current state from the activity (or return it from the procedure). The result has the set of tracked objects in the trackedObjects
property, and the set of dropped objects in the droppedObjects
property.
Parameters
Track Motion requires the following parameters.
- state – The current set of tracked objects. A null value indicates no current state.
- newObjects – The set of new objects with positions.
- algorithm – Algorithm to use to determine motion.
- qualifier – Value used to determine if two positions could be movement of the same entity.
- maxAbsent – An interval after which an entity is considered missing. Missing objects are dropped from the set of known objects.
- timeOfObservation – Time to assign to the observation. If unspecified, use the current time.
- coordinateProperty – The name of the property from which to get the coordinates. The default value is
location
. This can be used if the input stores location information under a different property name. - labelProperty – The name of the property from which to extract the label. The default value is
label
. This can be used of the input labels things using a different property name. - trackingIdSourceProperty – The name of a property that is known to contain a unique value. If present, the value is used as the tracking id. If absent (or if the value for an instance is missing),
then the tracking id is generated.
Input data consists of things with locations, as described above.
Algorithms
Motion tracking is performed using either of two (2) algorithms: centroid or bounding box. Both algorithms maintain the set of known objects and their last positions. As sets of new positions arrive,
compare the new positions to the old, determining which entity in the new set should be considered motion of the old entities. The remainder of this section describes that in more detail.
Both algorithms limit their comparisons to things with the same label. That is, if a car and a boat appear “near” one another in successive images, it is unreasonable to determine that the car moved to a boat.
Both algorithms include the notion of a qualifier. This further qualifies the comparison for purposes of determing movement. Each algorithm’s use of the qualifier will be noted below.
Centroid Algorithm
Compare the positions of the center of the bounding boxes of the two positions. The pair with the smallest euclidean distance (having identical labels) is considered motion.
The centroid algorithm uses the qualifier to specify the maximum distance an entity can travel and still be considered “motion.” If you are tracking cars approximately once per second, it is unreasonable to expect them to move 10 miles in a single second (under normal circumstances).
The centroid algorithm operates as follows:
- For each new entity, compare it to the set of known entities.
- If the two entities have the same label,
- Determine the euclidean distance between the two entities
- Find the closest object where the distance is not greater than the maximum distance
- Those two objects are, then, considered movement from old to new.
- Give them the same tracking id.
- If there is no matching object, then consider this a new object for tracking purposes.
- Assign it a new tracking id.
Bounding Box Algorithm
Compares the overlap of the bounding boxes for two entities with identical labels. The comparison resulting in the largest percentage overlap is considered motion.
The bounding box algorithm uses the qualifier to specify the minimum percentage overlap to be considered motion. So, if the qualifier is 0.50, that specifies that the bounding box from the new position must overlap by at least 50% with the old position to be considered motion.
The bounding box algorithm operates as follows:
- For each new entity, compare it to the set of known entities.
- If the two entities have the same label,
- Determine the percentage overlap of the bounding boxes for old & new entities
- Find the object with the highest overlap percentage (whose overlap percentage qualifyies)
- Those two objects are, then, considered movement from old to new.
- Give them the same tracking id.
- If there is no matching object, then consider this a new object for tracking purposes.
- Assign it a new tracking id.
- If the two entities have the same label,
In either case, the output of the track motion step is a set of tracked entities (trackedObjects
), where each tracked object’s location contains the location information, the tracking id (trackingId
),
the observation time (timeOfObservation
).
If the trackingIdSourceProperty value is provided and that property contains a value, the value found will be used as the tracking id. If not, a tracking id will be generated. Where a unique value is known (for example, license tag numbers or facial recognition systems), the unique id can be used to track motion across areas that are disjoint.
Build and Predict Path
Takes the output from the track motion step and assemble paths for the tracked objects.
- Find the path for that tracked object (determined by tracking id).
- Add the new position to the end
- If the maximum path size is exceeded, remove elements from the start of the pat
- If no matching path is found, create a new tracked path.
At the end, return the set of known entities and their paths as well as a list of objects dropped and their predicted positions (if desired).
For objects for which there is no input (that is, track motion is no longer tracking the object), (optionally) return a predicted position. This will also remove the object from the list of actively tracked paths.
You can find a detailed output example in the Build And Predict Path section of the App Builder Reference Guide.
Parameters
- state – The current set of tracked paths. A null value indicates no current state.
- newObjects – The set of new objects with positions.
- maxSize – (optional) The maximum path length Default value is 10.
- pathProperty – (optional) Property name to use to store the path within the location object. Default is
trackedPath
. - coordinateProperty (optional) The name of the property from which to get the location information. Default is
location
. - doPredictions – (optional) Boolean indicating whether to predict the positions of objects dropped from tracking.
- timeOfPrediction – (optional) Time to assign to predicted positions. If unspecified, use the current time.
Predicting Locations
PredictPositions
Given a path, we can predict the next location, extrapolating from last two known positions. To do so, we use MotionTracking.predictPositions(), which returns the list of paths with their predicted locations.
Parameters
- pathsToPredict – The current set of tracked paths for which to predict next positions.
- timeOfPrediction – (optional) Time to assign to predicted positions. If unspecified, use the current time.
- pathProperty – (optional) Property name to use to store the path within the location object. Default is
trackedPath
.
PredictPositionBasedOnAge
We can also selectively predict positions based upon its age (or time we last saw some object). To do so, we use MotionTracking.predictPositionsBasedOnAge(). This procedure evaluates the candidatePaths against the expirationTime. For any paths whose last timeOfObservation is at or before the expirationTime, we predict then next position (based on timeOfPrediction) and return
that list. Paths after the expirationTime are ignored.
Parameters
- candidatePaths – The current set of tracked paths for which to predict next positions.
- expirationTime – The time representing the latest time considered expired.
- timeOfPrediction – (optional) Time to assign to predicted positions. If unspecified, use the current time.
- pathProperty – (optional) Property name to use to store the path within the location object. Default is
trackedPath
.
Tracking Regions
Track Regions provide the ability name regions within the coordinate system used by applications in a namespace. A detailed reference can be found in the Tracking Regions section of the Resource Reference Guide. Tracking regions have the following properties:
- name – the name of the tracking region.
- boundary – an Object containing a list of (at most 4) points that comprise the boundary of the region.
- distance – an Object containing the following properties:
- points – a list of two points
- distance – the distance between these two points
- direction – an Object containing the following properties
- points – a list of two points
- direction – the direction (compass degrees between 0 and 360) that movement from the first point to the second represents.
In each of the above, specify points with an x and y component (alternately, you can use lon and lat, respectively). Again, for details, please see Tracking Regions section.
Tracking regions need not be mutually exclusive. That is, some particular location may be found in many different tracking regions (often shortened to regions) or none. For example, if we consider a traffic intersection, a single location could, quite reasonably, be contained in all of the following regions:
- The intersection
- The cross street
- The crosswalk
All regions in a namespace are expected to be in the same coordinate system. A coordinate system here refers to a consistent set of coordinates that provide the location information. See the Application Coordinate System and Multi-Camera Applications for further discussion.
By default, the set of regions in a namespace comprises the region search space for all applications in that namespace. That said, it is possible, within a single namespace, to have sets of tracking regions that the application considers disjoint. For example, consider a set of cameras that track the motion of objects in a set of completely separate buildings. Any given camera is known to belong to a particular buildings. In such an environment, it may not be desirable to create a coordinate system that maps all buildings separately within the coordinate space. Instead, we may consider a set of regions for each building or set of buildings – these buildings or sets thought of as having overlapping coordinate systems. When choosing to do this, it is important to ensure that the sets in use are completely disjoint.
To do this, two things are necessary. First, each such set of regions must be named in such a way that the specific set can be determined. Second, each such set of regions must include the distance and direction properties if velocity is expected to be determined.
The region search space can be restricted through the use of the trackingRegionFilter
property on the BuildAndPredictPath
and PredictPathsByAge
activity patterns when building apps. When using VAIL code to find regions, the list of regions to consider is passed to MotionTracking.findRegionsForCoordinate().
If an application is making use of the ability to use the same coordinate space for what it considers disjoint sets of tracking regions, the application is responsible for ensuring that these disjoint sets are always used consistently, and that the distance and direction properties are consistent and properly present in each such set where velocity information is expected.
An entity’s location is determined to be in a particular region if that entity’s bounding box’s centroid is located within or on the border of a region (i.e. it is not outside the region). Note that this means that the entire bounding box need not be contained within the region.
As an example, consider the following image.
To our application, this image may have a number of named areas of interest.
In the image below, we can imagine (and roughly draw) a regions defining
- the intersection (white),
- the main street (green), and
- the bike path (magenta).