Diagnosing Faults and Scalability Issues In Vantiq Applications

PostedSeptember 29, 2023

UpdatedSeptember 29, 2023

ByVantiq

0 out of 5 stars

5 Stars		0%
4 Stars		0%
3 Stars		0%
2 Stars		0%
1 Stars		0%

by Paul Butterworth

May 26, 2019

Revision: 0.30

Introduction

Discovering and correcting failures and performance issues
during the development of complex, high performance systems is always a
challenging problem. Strategies and techniques available to investigate and
correct such issues are presented in this paper.

This document was motivated by issues that arose recently
during the development of two different VANTIQ systems. One of the applications
exhibited a system-wide error that appeared to be unreproducible. The second
application exhibited a stubborn scalability problem. These applications will
be used as illustrative examples in this paper.

Example Applications

Event Broker Application

This application of Pronto, the Advanced Event Broker,
accepts events from an ERP system as a published event. The event contains a request
for the execution of a specified service. A number of services are available
with a separate event (topic) used to identify the specific service being
requested. The service is invoked by sending it an event. The service responds
with a completion event. The payload associated with the completion event is
subsequently returned to the ERP system by invoking a REST operation supported
by the ERP system.

The overall flow of control in the Pronto application is
depicted in the following diagram:

Elevator Management

This application, in its first release, monitors up to
250,000 elevators. Each elevator reports status once a second by publishing to
a topic using the VANTIQ REST interface over HTTPS.

The application displays a table containing elevator status
for a selected subset of the elevators. The user may drill down to see the
detailed status of a specific elevator. Once a specific elevator is selected,
the user sees real-time (once a second) updates to the detailed status of the
selected elevator. The user may elect to drill down and see video from the
elevator if so desired.

The application also computes a number of statistics at a one
minute interval. The statistics include the time the elevator has been online,
the count and duration of up and down movements, average temperature, and
others. These per minute statistics are further aggregated to present hourly,
daily, weekly, monthly and yearly statistics for each elevator. Aggregate
statistics for reporting purposes are retained for a year.

The status of the elevator is determined by observing
whether the elevator is producing status reports every second. An elevator that
produces no status reports for more than one minute interval is considered
offline. If the elevator subsequently posts a valid status report, the elevator
is returned to online status.

Diagnostic Strategies

There are a wide variety of diagnostic and debugging
strategies. This document is very opinionated presenting a very specific set of
diagnostic strategies that the author believes are most effective based on a
long history of diagnosing problems in complicated systems.

Typically, the problems reveal themselves as system level
problems:

·
System errors are being produced that indicate the application is
not working correctly.

·
Some expected processing does not occur. The problem may exhibit
consistent symptoms, the expected processing never occurs, or may exhibit what
look like random symptoms – sometimes the expected processing doesn’t occur. In
event-driven systems the problem is likely to be reported as “messages are
getting lost”.

·
The system will not scale to support the specified load.

·
Request latencies are too long.

The overall diagnostic strategy is conceptually simple:

·
Build a simple model of the system.

·
Run the system

·
Inspect the error logs

·
Correct any system errors in the error log

·
Apply a known workload to the system

·
Correct any resource exhaustion errors

·
Correct any unacceptable response time issue

The steps in the overall diagnostic strategy should always
be executed in this order. The reasoning is that you cannot diagnose
performance problems if the system is not working correctly. Similarly, you
cannot diagnose unacceptable response times if the system has exhausted
available resources.

The diagnostic strategy relies heavily on the mode of the
system as the primary fault isolation technique to determine which component of
the system is responsible for the problem being considered and then drilling
down into that component, applying these diagnostic techniques recursively,
until the problem is isolated and corrective action is taken.

Models

Key to diagnosing any system problems is to have a simple
model of how the application works in terms of major components and how data
flows from component to component.

These are not models built with the
App Modeler, the App Builder or the Collaboration Builder. Those models focus
on requirements and application semantics. They do not focus on scalability nor
do they clearly identify the resources that have a major impact on performance
and scalability.

For fault diagnosis the model must be simple. A general rule
of thumb is to keep the model to around 6 components or less. Models with more
components quickly become too complicated to easily reason about. In addition,
smaller components increase the probability the problem will span multiple
components making the fault isolation problem more complicated.

The description of the Event Broker Application (above)
included an informal model that depicted the basic flow of the application. It
is also possible to build a more formal model using a Sequence Diagram that
shows the flow of execution for the major components of the application. If you
decide to formalize with a sequence diagram, make sure the only resources
represented in the diagram are those that have a major impact on scalability.
Adding more detail will only serve to confuse the issue when it comes time to
diagnose scalability problems. A sample sequence diagram for the Event Broker
Application is presented below:

For scalability diagnosis the model is organized around the
primary use cases that must scale. Within this context models should be
constructed for each high traffic use case. The model for any use case should
be simplified to the point it models inbound and outbound events and any
database activities performed to support the use case. Additional details are
unnecessary at this point. If you elect to diagram the scalability model, it is
essential to include all resource consuming activities in the diagram:

·
inbound events

·
outbound events

·
database operations

·
invocations of external systems via SOURCEs

System Execution

With models in hand it is time to exercise the system. If
you are still in the process of developing the system the guidelines for
exercising the system are structured.

First run single requests through the system to check their
behavior. Run requests for each major use case the system should support. If
errors occur do not scale up the testing until the errors have been fixed.
Attempting to run a broken application at greater scale is a complete waste of
time.

Once the single request executions are solid and error free
it is time to move on to more complex workloads. The next step in the process
is to apply a workload that is a sequence of requests. This checks that the
system does not have any behaviors that work one time but not on subsequent
requests. Typically, these problems occur because some state being stored in
the database is not fully understood and, therefore, is left in an inconsistent
state that causes errors in the following requests. Problems at this level may
also be caused by sources in which the behavior of the externally system is not
well understood and the requests from VANTIQ to the external system are not
properly constructed to handle a sequence of request.

At this point we can also take a first look at latency since
we can collect the time it takes to run the sequence of requests and determine
the average response time of each individual request. The numbers we obtain are
going to serve as our baseline because response times will only lengthen as we
place a greater load on the system.

Once we have completed the simple executions and corrected
any faults detected, it is time to move on to larger, concurrent workloads. Do
not move on if errors are still being produced by the system or if the
latencies are already unacceptable. Larger loads are not going to increase your
ability to diagnose the issues the system is already exhibiting. See the
section below: Apply a Known Workload, for suggestions on how to load
test the application.

Inspect Error Logs

One thing we cannot emphasize enough is that if the system
is generating errors, the system is not working correctly and ignoring the
errors with the thought that you can deal with them later never works.

There are times when errors are expected but all expected
errors should be handled by the application. Any errors that are not handled are
considered faults and must be corrected.

Correct Errors

There is not much to say here. The errors must be corrected
or handled by the system. There are no exceptions. Generally, errors produced
internal to the system can be corrected with appropriate changes to the
system’s code. Errors produced as a by-product of communicating with external
system may not be correctable in the sense they can be eliminated. Such errors
must be accommodated by the system. For example, sending requests via a SOURCE
may fail since there is always a possibility the network is down or the
external system connected to the source is down or your credentials expire,
etc. In such cases the system must be able to handle the problem in some
graceful fashion. The code to handle the error must be developed and included
along with documenting any recovery procedures necessary.

Apply a Known Workload

Start increasing the load incrementally at this point. In
the previous test we ran a sequential stream of requests. However, we want to
apply the load in a controlled manner so that we don’t have to deal with
unknowns in both the load being applied and the behavior of the system; we wish
to only deal with unknowns in the behavior of the system we are testing.

The simplest way to provide a controlled load is to use test
generators. However, the test generators MUST be well understood and the load
they generate must be well understood. This is easy to say but hard to achieve.
The first mistake made by developer new to scalability issues is the
construction of test generators that either do not scale or do not apply the
expected load. For example, if you build an application that submits 10
requests/sec and you actually need a load of 1,000 requests/sec, you might
imagine running 100 copies of the application. Makes sense but then questions
must be posed and answered:

·
Are all 100 copies of the generator running on a single machine?

·
What conflicts might we see that will cause the loaf produced to
be less than 1,000 requests/sec?

·
Not enough compute power?

·
Not enough memory?

·
Scheduling issues that cause the load to be less than evenly
distributed?

·
I/O issues or local networking issues that might limit available
throughput?

If you decide building test generators is too complicated,
you might be attracted to existing test generation technology such as Gatling
(which VANTIQ engineering uses for some tests). However, be forewarned that now
you must understand the dynamic behavior of Gatling and the infrastructure on
which you run it. All the questions above still must be answered but now in the
context of the test infrastructure. The benefit to these technologies is that
once you have invested in understanding them, that knxhaustion Errors

As the applied workload increases you may encounter resource
exhaustion errors. These errors are a little different than the ones discussed
previously in that these do not represent logical errors in the application.
Rather they represent the fact the application is using more resources than it
has been allocated.

Obviously, there are two possible corrections for this
problem:

1. Allocate
more resources

2. Use
fewer resources

The approach selected is very important when considering the
long-term health of the system because no matter what you think, resources are
finite. They are finite either because more resources are simply not available
or more resources will put the operational costs of the system over budget.
These considerations impact whether or not it is possible to increase
resources.

Making the system more efficient (aka using fewer resources)
is generally required if the estimated cost of operating the system is over
budget or the application bumps up against a hard limit in the available
resources.

Details will be discussed in the chapter: Diagnosing Scalability
Problems.

Correct Latency Issues

Once the system is capable of supporting the workload
without exhausting any resources, the next task is to make sure latencies are
acceptable. The system may not be exhausting resources, but it is using so many
resources that latencies are far too long. If the application is responding in
10 seconds when it needs to respond in one second, there is much work to do.

Details will be discussed in the chapter: Diagnosing
Scalability Problems.

Summary

This chapter has presented an overview of the general
diagnostic process we recommend for system errors and scalability errors.
Detailed diagnostic strategies for system errors are discussed in the next
chapter followed by a chapter discussing detailed diagnostic strategies for
scalability issues.

Diagnosing System Errors

With a complex application, the first diagnostic issue that
must be dealt with are system errors. System errors are found in the error log.
If you find errors in the log, your application is likely not working
correctly. Specifically, if you are seeing errors and those errors are not
being handled by the application, the application is NOT working correctly. It
is unlikely you can explain these away in a satisfactory fashion and they will
continue to plague all your diagnostic activities until you fix them!

An example of an error that IS expected and properly handled
is illustrated by the application VANTIQ uses to monitor the health of all
VANTIQ clouds. The monitor application expects to receive an event from each
VANTIQ cloud every few seconds. If the monitor application does not receive an
event from a cloud within 60 seconds, an error occurs. Specifically, the rule
times out waiting for an event. The monitor application catches the resulting
exception and marks the system as offline. The system error is expected as part
of the normal operation of the application and it is properly handled within
the application.

Another example of system errors that are properly handled
are requests to REST services that may time out because the REST service being
invoked is slow or unreliable. The proper way to handle such a failure is:

·
Retry the request because it may be a transient failure.

·
When retrying it is best to use some exponential back-off scheme to
reduce the number of retries attempted on a REST service that is going to be
offline for a longer period of time.

·
If the application needs a response, convert the error from a
system error into a user error and provide the user with instructions on how to
proceed.

If the error is not caught and the application simply
expects it to work every time, the application will not run correctly.

This is a good example of an
intrinsic behavior exhibited by distributed systems. If an application
communicates with another application or service over a network, there is
guaranteed to be a time when a request to the remote application or service
will fail. This is not theoretical, there WILL be a failure. For an application
to be reliable, any errors produced by a remote request MUST be caught and
handled. Do not try to finesse your way out of this requirement. There WILL
eventually be a failure and you need to prepare for it.

Let’s look at a real-life example of diagnosing a fault in
our Event Broker example.

The problem was originally detected when 100 messages were
presented to the platform by the ERP system. Unfortunately, only 90 of the
messages successfully produced results. This problem was reported as “Pronto is
losing messages”. Of course, “Pronto is losing messages” reports the externally
observed behavior but does not provide any insight into the actual problem that
is causing the messages to be lost. This class of problem reports, in which the
results appear to be random with some working and others not working, is
traditionally a class of problems that are very difficult to diagnose because
there is no example of an event that will ALWAYS fail.

This problem was diagnosed by creating a model of the system
that consisted of three top level components:

·
Accepting requests from SAP

·
Dispatching requests to the target services

·
Sending responses to SAP.

The diagnostic approach taken was to first isolate the
problem to one of the major components by testing each component in as isolated
an environment as possible.

1. Test
the publishing of messages to the event broker from the ERP system. No messages
were lost when running just this portion of the application.

2. Test
the delivery of messages to the services and the publishing of responses back
to the event broker. All messages were delivered and all responses were
published.

3. Test
the delivery of the responses to the ERP system. Not all responses were
delivered as it was found that some REST requests to the ERP system would time
out.

At this point the problem is isolated to a single component.
The developer’s task at this point is to drill down and determine the specific
condition that is causing the failure. There are several elements of the
request that might be checked:

·
Is the response too large and, therefore, the ERP system is
unable to process it in a reasonable amount of time?

·
Are the timeouts set to too small a value?

·
Is there something wrong with the VANTIQ REMOTE service or its
interaction with the ERP system?

·
Is the REST request improperly formatted? Usually this means some
header or content type is set incorrectly or is not set at all.

A number of these potential problems can be verified by
constructing an alternate interface to the ERP system. For example, use CURL to
present the same request with the same data to the ERP system. If the alternate
interface works, the REST service request can be checked to make sure it
mirrors the successful request. Other problems can be checked by inspection or by
running a series of test cases. For example, applying different message sizes
to see if there is an undocumented limit on the size of a message.

In our ERP system example, the result of the detailed
analysis was that there was nothing wrong with the requests themselves or the
VANTIQ REMOTE service. The ERP system simply didn’t process the requests fast
enough or ignored them altogether. It was impossible to determine if the ERP
system was the problem since access to the ERP system to diagnose the problem
was not available.

The problem was resolved by catching failed REST requests to
the ERP system and re-trying them. Since the failures were random, a retry
would usually succeed.

Previously, we discussed the need
to inspect the error logs and eliminate any system errors that are not handled
by the application. In this example, the log contained the timeout errors
reported by the source. An inspection of the error log early in the diagnostic process
might have pointed directly at the problem and allowed the development team to
more quickly isolate the problem. Of course, the logs are, in general, noisy
since they may contain many log entries and the significance of individual log
entries can get lost in the noise inherent in a log. In such cases, the
diagnostic techniques presented above will usually solve the problem.

The main lesson related by this chapter is that in most
cases incorrect system level behavior can usually be isolated to a problem in a
specific component of the application. However, this is not clear when viewing
the external behavior of the application exhibiting what look like random
failures. The trick is to break the application up into a few coarse-grained
components and test each in turn checking to see if it is working correctly.
When a component is added to the test that exhibits a problem, it is time to
drill down and find out exactly what the issue is inside that component and fix
it.

Diagnosing Scalability Problems

Scalability problems may be seen in any application but, as
a rule of thumb, applications that process fewer than 100 events/second
generally don’t have scalability issues. Applications processing more than 100
events/second or expected to grow to more than 100 events/sec are likely to
exhibit scalability issues. Throughput and latency will necessarily be key
concerns.

Scalability problems are usually observed when the
application is executed with a heavier workload applied. The first indication
there is a scalability issues with the application is when it exhibits one of
the following behaviors:

·
Response times are lengthening substantially

·
Quotas are being exceeded

·
System errors are being generated (other than quota issues)

·
Random failures occur

Taken as a whole the observed behaviors make the problem
look overwhelming. The logs may contain thousands of errors and response times
rapidly increase from acceptable levels to the point where it looks like the
application is not even processing requests. However, some basic systems level
thinking can help you get a handle on the problem.

Build Performance Model

In order to get a handle on whether the application is
performing properly an essential first step is to develop a model of the
application that estimates the resources it uses on a per transaction basis and
the aggregate resources required to run the application at full scale. Building
such a model is an important exercise. If you can’t build such a model, then
you do not know how your application works and you are doomed to have
performance problems since you will be unable to reason about the resources
needed to handle any events presented to the application.

An easy way to build a performance model for an application:

·
Identify the application’s major use cases. Major use cases are
any that are run on a regular basis. Of most interest are use cases that apply
to every inbound event or, at least, a large number of inbound events. Use
cases that only run occasionally can be saved for a deeper analysis (if
required).

·
For each use case

o
Count the inbound events.

o
Trace the flow of processing for each use case identifying any
database activities that are executed on behalf of the use case. Count these by
type:

§
Select – one row

§
Select – many rows

§
Insert

§
Update

§
Upsert

§
Delete

o
Trace the flow of processing for each use case identifying
invocations of a machine learning model or calls to external systems. These are
both expensive operations that will have a significant impact on the
scalability of the application.

o
Count the outbound events

Analyze the Workload Using the Performance Model

Once the modeling exercise is complete and we understand the
resources required to run each use case, we can use the workload specification
to compute the frequency of each use case and determine the overall load that
must be handled by the application. The workload is generally documented in
operations per second (usually written <ops>/sec). For scalability
analysis we will be interested in both the average workload and the peak
workload. The peak workload is important because it determines how much compute
power (resources) must be allocated to the application to satisfy its maximum
demand for resources.

As an example, let’s build a simple model of the elevator
application. As a reminder, the application supports a maximum of 250,000
elevators with each elevator reporting status once a second. The major use
cases are:

1. Each
status report from each elevator is ingested and recorded.

2. A
client displays a summary of elevator status for a selected set of elevators as
a list.

a. Client
can select an elevator and see the real time detailed status for that elevator.

3. The
online/offline state of each elevator and the length of time it has been in its
current state is computed by making sure each elevator generates at least one
event every minute. When the single event per minute is published, a counter is
incremented and store in the database indicating the elevator has been in its
current state for another minute.

4. Elevator
statistics are aggregated and stored once a minute. The specific aggregates are
time going up, time going down, number of floors visited, average temperature.

5. Clients
can display statistics for an elevator on an hourly, daily, monthly and yearly
basis.

The corresponding performance model for the elevator:

Inbound events: 250,000/sec

Use case 1: store the inbound
status report by upserting it into the database. This gives us 250,000
upserts/sec across all elevators.

Use case 2: assume the status display
is update once every 10 seconds for each client. 100 elevators are displayed by
each client. There are 200 clients. Assuming a uniform distribution this gives
us 20 selects/second with each select returning 100 elevators (objects).

Use case 3: Use a missing activity
pattern to detect elevators that have not sent a status report in over a
minute. Record the fact this elevator is offline in the database. Use a limit
activity pattern to determine the elevators that are online. On the minute mark
increment the online time for the elevator by performing an update on the type
containing the current status for the elevator. This results in 4,167
updates/sec. We also need to observe that a simple read of the statistics gives
us 4,167 database operations/sec but, in reality, the limit activity is going
to schedule this processing for all online elevators at the time the minute
interval is reached. Thus, at full load, this is going to queue as many as
250,000 activities instantaneously, each of which will eventually make a
database request.

Use case 4: Aggregate inbound data
each second using VANTIQ real-time statistics. In addition, use a limit
activity pattern to trigger one event per minute for each elevator. When the
limit triggers an event, write the last minute’s statistics to a log. Assuming
a uniform distribution, this results in 4,167 upserts/sec. We also need to
observe that a simple read of the statistics gives us 4,167 database
operations/sec but, in reality, the limit activity is going to schedule this
processing for all online elevators at the time the minute interval is reached.
Thus, at full load, this is going to queue as many as 250,000 activities
instantaneously, each of which will eventually make a database request.

Use case 5: These are aggregate
queries over the last hour, day, week, month, year on the per minute
statistics. Let’s assume the average query is for a month of data which amounts
to 60 * 24 * 30.5 records that have to be aggregated. This works out to: 43,920
records per average query. Further assume that on average only one such
aggregate is being run concurrently.

The headlines for this application:

·
Events: 250,000/second

·
Upserts: 258,334/second

·
Queries: 11/second.

With this data in hand we can estimate the compute resources
required to support this workload.

·
250,000 events/second. Assume a single vCPU can process 1,000
events/sec implying 250 vCPUs. Assuming 4 core machines that gives us 63 4vCPU
machines to process the events.

·
258,334 upserts/second is not viable because we assume the DBMS
can do no more than 20,000 upserts/second. This is a hard limit so this
indicates the application WILL NOT SUPPORT ITS MAXIMUM WORKLOAD. There is no negotiation
possible. The app will have to use a different architecture.

·
11 queries/second is small compared to the rest of the workload;
therefore, we will ignore it.

·
The limit activity pattern that is triggered once a minute does
not actually distribute the events it produces uniformly over the minute
interval. Rather, when the minute timer goes off, all the work for the last
minute is scheduled. This means that up to 500,000 events are published almost
simultaneously. Since the system will not be able to process the events
instantly, the work will be queued. Unfortunately, the system will not queue
this much work without severely compromising memory. In order to avoid the
memory problems, the queues are limited in size and the attempt to queue this
much work will result in a large number of quotas violations. This is another
scalability problem for this application that must be addressed.

This exercise has demonstrated one of the benefits of the
model. It took less than an hour to define the use cases and build the model.
It provides a fairly precise description of the application processing and it
made it clear the application as imagined simply will not support the specified
workload. No code has to be written; no stress test has to be executed. Instead,
in an hour we discovered we MUST develop a more scalable architecture.

Developing a Scalable Application Architecture

Now that we have demonstrated our initial application
architecture is not scalable, we must develop a more scalable architecture that
addresses the issues identified by analyzing the workload applied to the model.

We first tackle one of the most problematic part of the
workload – the 250,000 upserts/second required to capture the real-time status
of all the elevators. In addition, the status is captured for the sole purpose
of displaying it to a maximum of 200 clients. Can we find an architecture that
is more efficient at populating the client displays than storing every status
report in the database?

There are a few alternative approaches discussed in the
following subsections.

Stream Status Directly to Clients

One approach is to stream the status directly from the
inbound event to the client. The most aggressive version of such an
architecture performs no database operations of any kind. Observe there are a
maximum of 200 clients and each client can display the real-time status of
exactly one elevator at any time. In addition, the elevator selection for any
client doesn’t change very fast. On average let’s call it once every 30
minutes. That means we are changing ,on average, one elevator display
assignment every 9 seconds.

At runtime, the system has to route the status to any client
that is displaying the elevator for which a status report is being processed.
One way to do this is to have each client listen on a unique topic on which the
real-time status of the elevator the client is currently displaying is
published. To bind this client topic to the status events for the designated
elevator, we dynamically construct a rule that detects a status report for a
particular elevator and re-publishes the event to the unique client topic. Note
that these rules have to be constructed on the fly as each client selects an
elevator to view but, at the rates we are talking about, dynamically creating
these rules should not be that much of a burden.

An alternative to dynamically creating rules to bind clients
to elevators is to republish every arriving elevator status event on a unique
topic for each elevator. Of course, this implies we have 250,000 topics but
large numbers of topics are not a scalability issue. When a client selects an
elevator for which to display real-time status, the client subscribes to the topic
corresponding to that elevator to receive the real-time status.

Use SELECTS to Reduce the Number of UPSERTS

Another alternative is to convert the upserts into selects.
In this case each client registers its interest in an elevator by adding an
object to a type. When a status report comes in, the rule that processes status
reports consults the database by doing a select and sends the status report to
any client that is observing that elevator. This converts the 250,00
updates/second to 250,000 selects/second (to check for observing clients).
Since selects are probably 10 times as efficient as upserts, it is far more
likely the system can support the select load more efficiently than the upsert
load. However, this approach feels like it is still dedicating far too many
resources to making status available for display on our 200 clients.

A variant on the above architecture that continues to store
the status in the database is to depend on statistical properties of the
elevator status reports to reduce the workload by converting the upserts into
selects and followed by fewer upserts. The statistical property this approach
depends on is that elevators report status often but actual elevator status
doesn’t change that often making many of the status reports redundant.
Elevators are idle a large percentage of the time and so their actual status is
not changing. Exploiting this semantic the system retrieves the current status
of each elevator when a status report arrives. If the existing status and the
new status are semantically identical, do not bother to do the update. Just a guess
at this point but I suspect the number of updates is likely to be closer to
5,000/second than 250,000/second using this approach. However, we still have to
deal with 250,000 selects/second. This is easier to manage than the same number
of upserts but this is still a large number of operations implying the
operational cost of the application will be higher than we would prefer.

UPSERT Status Only for Elevators being Displayed by a Client

Next, we look at a scheme that reduces the upsert rate by
storing status reports only for elevators that are being viewed. This is
accomplished by dynamically constructing a rule when a client subscribes to
status update for an elevator. The rule detects if a status update is for the
elevator being displayed. If it is, it writes the status to the database. Thus,
the only status actually written to the database is for those elevators that
are bound to real-time displays. When the client switches to another elevator,
the rule is removed.

Reduce the Sampling Rate

A final approach is to simply reduce the rate at which
updates are recorded. For example, if we elect to only store every tenth status
report, the upsert rate is reduced by an order of magnitude to about 25,000
upserts/second – a rate that is still very large but may be possible.
Unfortunately, this may be practical but isn’t very satisfying because we were
required to change the requirements of the application and can only display
status collected every ten seconds. The client’s real-time display is less
real-time than originally specified.

Diagnose Scalability Issues In Application Under Development

The previous section illustrated
how building a model is a very efficient approach to evaluating application
architectures from a performance standpoint. However, this is not always
possible existing applications that are exhibiting scalability issues. In this
section we discuss strategies for diagnosing problems in applications that are
under development or can be placed in development in order to diagnose
scalability issues. Production applications that cannot be changed or placed in
development are discussed in another (yet to be written) document.

As discussed earlier, an existing VANTIQ application with
scalability problems will exhibit one of the following behaviors:

·
Errors

·
Slowing response times (or no response at all)

Errors

The first thing to inspect are the error logs. Trying to
figure out what your application is doing or how it is performing is a waste of
time if it is consistently producing errors. The errors will mask any useful
information about the dynamic behavior of the application.

There are two classes of errors to consider:

·
User errors

·
Application errors

User errors are expected and are typically not a problem.
They are generated by users making mistakes in data entry or configuration
activities. For example, the system asks the user for a file and the user types
in a file name. Some reasonable percentage of the time the filename will be
typed incorrectly and a “file not found” error will be generated. Typically,
such errors are not a cause for concern as they are expected and handled by the
application.

In contrast, system errors are unexpected, and their
occurrence interferes with the proper behavior of the application. Some typical
errors include:

·
Timeouts. Anything that times out is likely to be a problem.
Timeouts occur for two reasons:

o
Something isn’t working. This normally applies to IO operations.
A classic example is invoking a REST service and the request times out before
it completes. This may be a networking problem, a problem in the remote service
or you may have asked it to do too much work and it cannot respond in time.
These problems may also be caused by poor or overloaded networks. The worst
case is when these errors are not very predictable, and the developer will have
to drill down further to find the root of the problem.

o
Too much work is being done. VANTIQ has a limit on the maximum
runtime of a procedure. That limit is 2 minutes. If you ask the system to do
something that takes more than two minutes you will receive an error. The correct
response to such an error is to break up the work so it can be done in smaller
chunks.

·
General execution errors. There can be lots of reasons for a wide
range of runtime errors:

o
Missing resource – calling a procedure that does not exist.
Calling a remote procedure that does not exist. Referencing a type that does
not exist. In VANTIQ this may be a remote resource that doesn’t exist so pay
attention to the error messages.

o
Faults in VAIL code. All the usual errors may occur such as
missing variables, invalid assignments, etc.

o
Missing events. These are insidious because the app may raise an
event that no one is listening to. Conversely, your app may be listening, but
no one is publishing. Many times these inconsistencies are caused by typos in
the topic names. Pronto can help diagnose these problems because both
publishers and subscribers are listed in the catalog so it is easy to see if
you are missing one half of the equation.

·
Not handling return values correctly. This is an insidious
problem since values are being returned but they are not the values you expect.
It is critical to read and understand the documentation on how return values
are produced because our asynchronous execution model does not always match
your intuition as to how the return value is produced.

·
Not handling query results in a streaming fashion. If you make it
an array you are bound to have problems on general queries.

·
Authorization errors when the user doesn’t have enough rights to
access something.

·
Configuration errors particularly when talking to external systems.

·
Resource exhaustion and quota violations for resources allocated
to the application. These either mean you have a defective application that
consumes all available resources or the application is doing more work than the
resources allocated to it allow. Quota and resource errors are a key discussion
topic below.

Quota and Resource Consumption Errors

Any error that represents a quota and resource violation
MUST be corrected. An organization is allocated a specific quantity of
resources and cannot exceed those limits. The limits are established by quotas.
Limits include the number of rules that can be active at any instant in time,
the number of rules queued awaiting execution, the running time of procedures
(which is always set to a maximum of 2 minutes), and the maximum rate at which
events can be presented to the system. The result of a quota exceeded message
is that some parts of your application are being terminated. Therefore, the
application cannot be running correctly!

Evaluate whether the violation is because your app is out of
control and uses too many resources or your quotas are too low. This kind of
analysis generally requires some sort of resource usage analysis of your
application projected onto the load you are currently applying as previously
discussed in the modeling section.

If you determine that your application is using resources as
expected and is using resources within specified budget constraint relief can
be achieved by requesting larger quotas. A request that indicates what the quotas
should be is always the simplest to deal with. A request without quantitative
demands will probably result in an iterative process where the quotas are
increased, the app is re-tested and any remaining quota violations are dealt
with by increasing quotas again.

Increasing quotas only works if
there are more resources available. In the elevator example where 250,000
updates/sec were required by the initial application architecture there is no
quota that can be assigned that will make the application work. The application
is simply demanding more resources than can be provided. Increasing quotas will
not solve this problem.

If the problem is that adequate compute resources are not
available, additional compute resources must be provided. In simplified terms
this implies increasing the compute power of the VANTIQ server cluster by
adding additional machines or increasing the size of the existing machines.

If the problem is that adequate database resources are not
available, it may be possible to increase the size of the servers running the
DBMS. However, you cannot increase capacity by adding additional DBMS servers.
Therefore, there is an upper bound on the amount of database resources that can
be supplied.

A couple of insidious cases include applications that queue
large amounts of work in a short period of time. The LIMIT activity pattern is
a prime example where it may queue a large amount of work when the limit
interval expires. If too much work is presented at once, a quota diagnostic will
indicate you have exceeded your quota and have queued too many events. Quotas
can be raised to some degree but, in many cases, a re-work of the application
architecture is required to smooth the delivery of requests to the system.

A related phenomenon is that high transaction
rate applications only work well if the load remains fairly constant. If you
are running 100,000 events/sec and periodically increase the load to 200,000
events/sec, the system may never completely stabilize after the spiking load is
presented. This leads to increased latencies and excessive resource usage. For
high transaction rate applications try to keep the load as steady as possible.

Excessively Long Latencies

When a system is overloaded response times degrade in an
exponential fashion. Truly overloaded system generally look like they are hung.
Of course, in some cases if you wait long enough you will get a response but
that is a pyrrhic victory as any normal user would consider the system hung and
not responding.

The usual response to long latencies is to instrument the
application in detail, run the app recording all the instrumentation and then
inspect the recorded results to try to identify where the problem is occurring.
This approach usually confirms there is a problem, but it rarely makes the
location of the problem obvious. The most extreme example of excessive
diagnostic output is execution profiling. Profiling provides lots of detail
but, if you are looking for a system problem, it is probably too much detail. A
better approach is to go back to the simplified model of the application and
instrument each of the major components with elapsed time instrumentation. Then
run the application and identify which component is producing the excessive
runtime and, by implication, the long latencies. Once a component has been
identified, use a decomposition approach to further instrument the application
until you have identified a specific problem that can be attacked.

The problem is most likely an excessive use of resources in
the VANTIQ application or overloading an external system that has been
integrated with the VANTIQ application. External integrations are a key area to
focus on when excessively long latencies occur.

Grafana

VANTIQ provides instrumentation for observing resource
consumption within VANTIQ systems. This information is available on the Grafana
dashboards available in the VANTIQ IDE.

One of the advantages of the Grafana dashboards is that they
can be used to confirm the accuracy of your application model. There are
several dashboards available:

·
Rule execution

·
Procedure execution

·
Resource usage

·
Source activity

·
Type storage

For event-driven systems, the first dashboard to observe is
the rule execution dashboard. Rule execution displays the aggregate rule
execution rates as well as execution rates for each specific rule. If you have
a developed an accurate model of the application and are applying a known load,
the execution rates for each rule and the system as a whole should match your
model. If the rates don’t match you have work to do to develop a more accurate
model!

First verify that all rules are executing. Specifically,
verify that the dropped and failed rates are zero. If these rates
are not zero, the rules are not executing to completion. This may be caused by:

·
errors (failed rule executions)

·
your application is exceeding its quota (dropped)

·
Another application executing with the same organization is
exceeding its quota

As stated previously in this
document, there is not point trying to scale the application if it is failing.
Deal with these errors now!

Below is a dashboard screen shot
showing an execution interval in which there were at least two quota exceeded
errors at 13:39 and 13:40.

A screenshot of a computer screen Description automatically generated

The errors are barely visible
since word has reduced the resolution of the screen shot. Magnifying the
relevant area in the above screen image, you can see two small blue marks just
before and at 13:40 in the image below. The blue marks represent a quota exceeded
execution failures.

The error has also been recorded
in the error log and is duplicated below:

Error in a resource of
type ‘audits’ with id ‘/rules/catchTimer’ Audit Alert: The
execution credit for the organization Vantiq has been consumed. As a result the
rule catchTimer is being throttled.

Execution Start Time:	2019-05-25 14:47:51.259
Execution End Time:	2019-05-25 14:47:51.259
Elapsed Time:	0ms

The audits failed during execution.
Code:		io.vantiq.security.audit.alert
Message:		Audit Alert: The execution credit for the organization Vantiq has been consumed. As a result the rule catchTimer is being throttled.

At this point it would be easy to
try to ignore this error. Our application is only running two events/sec and
cannot exceed its quota. Making such an assumption will simply waste your time.
The issue must be diagnosed. At this point we reviewed the code and assured
ourselves we are expecting to execute two rules/sec. We then reviewed the
dashboard over a longer term and confirmed two rules/second is our execution
rate. Finally, we observed that we are running in an organization that contains
a number of additional applications and it is one of those applications that is
exceeding the organization quota. Unfortunately, the fallout of that is that
our application is caught up in the failure. This is a VERY IMPORTANT lesson to
learn. If you are trying to measure scalability and resource usage accurately
you must be running in an isolated environment where other users are not
competing with you for a share of the resource quotas.

This is a key reason VANTIQ
recommends the SE and PS teams observe the following rules when building
applications:

·
Demos – demos do not run at scale and can be run in the VANTIQ
organizations such as:

o
VantiqEval

o
VantiqPOC

·
PoCs – do not run at scale; they are used to demonstrate
capabilities (not scale) to high value prospects. PoCs can be run in the VANTIQ
organizations such as:

o
VantiqEval

o
VantiqPOC

·
Pilots – pilots are applications being constructed for
prospects/customers and are usually sponsored and paid for by the customer.
Pilots generally have to demonstrate the scalability and, therefore, should be
developed in a separate organization dedicated to the pilot project. At some
point the pilot may be turned over to the customer and this becomes the
customer’s development organization.

·
Applications – applications are only built for customers and paid
for by customers. Applications must be developed and tested in an organization
dedicated to the customer.

If you ignore these rules and try to scale applications in
the VANTIQ namespace not only will you fail but you will damage the development
efforts of your colleagues.

The Rule Execution dashboard also indicates execution times
for the rules. The display includes p50 (the median execution time) and p99
(the 99^th percentile which we usually interpret as the worst-case
response time). These values are important if they are larger than you expect
or larger than you can tolerate. For example, the dashboard example below shows
rule execution below 20 msec at the p50 level and 250-300 msec at the p99
level. The p99 number is always going to be more volatile as it represents
worst case behavior and is impacted by any delays see in the processing of the
rule. For example, a garbage collect could have a major impact. For the example
application this is not a problem as we are receiving one event/sec and,
therefore, have 1000 msec in which to execute each rule before the next event
arrives. However, if the execution times are always above 1000 msec we have the
problem that rule execution does not complete before the next event arrives. This
is a chronic condition, then new work arrives faster than work is completed,
and the system will queue more and more work until it collapses.

A screenshot of a computer screen Description automatically generated

The above image shows a Grafana Rule Execution dashboard for
an application that is running 2 rules/sec. at a steady state and without any
errors.

Another dashboard that is quite valuable is the Resource
Usage dashboard. This dashboard shows select, insert, update, delete activity
on the resources (types) stored in persistent storage.

Warning: you can easily be led
astray by this display if you don’t pay attention to the “api” setting as the
dashboard displays resource usage either from the REST interface or from VAIL
applications. It is easy to think you are looking at rule activity (VAIL) when
you are actually displaying activity originating from the REST API (or vice
versa)

As observed earlier in this document, database resources
represent the ultimate limit on scalability within VANTIQ. Other resources can
be increased. If your application needs to execute a million rules/second more
compute power will solve the problem. However, the database has an upper limit
on scalability. This means the resource usage numbers are very important and,
in particular, that the aggregate resource throughput stays within a reasonable
range.

Our rule of thumb is to keep single
object SELECT activity below 20,000 requests/sec and single object updates
below 5,000/sec.

It is also important to pay attention to the operation
response times as excessive response times will have a similar effect here as
it does in rule execution where work starts arriving faster than it can be
processed.

The above image shows a resource usage dashboard for the
same application running 2 rules/sec. Each rule is doing an update to a single
object in a single type. Therefore, the system is performing 2 updates/sec at
steady state.

Summary

We have discussed a number of approaches to diagnosing
faults and scalability issues in VANTIQ systems. The discussion is not
exhaustive presenting only a representative sampling of techniques available
for diagnosing and correcting the discovered issues.

We hope you find this information helpful.

If you have other techniques you have used to attack these
problems please document them and we will add them to this guide.

Was this article helpful?

0 out of 5 stars

5 Stars		0%
4 Stars		0%
3 Stars		0%
2 Stars		0%
1 Stars		0%

Diagnosing Faults and Scalability Issues In Vantiq Applications

0 out of 5 stars

Introduction

Example Applications

Event Broker Application

Elevator Management

Diagnostic Strategies

Models

System Execution

Inspect Error Logs

Correct Errors

Apply a Known Workload

Correct Latency Issues

Summary

Diagnosing System Errors

Diagnosing Scalability Problems

Build Performance Model

Analyze the Workload Using the Performance Model

Developing a Scalable Application Architecture

Stream Status Directly to Clients

Use SELECTS to Reduce the Number of UPSERTS

UPSERT Status Only for Elevators being Displayed by a Client

Reduce the Sampling Rate

Diagnose Scalability Issues In Application Under Development

Errors

Quota and Resource Consumption Errors

Excessively Long Latencies

Grafana

Summary

0 out of 5 stars

Please Share Your Feedback

How Can We Improve This Article?

Getting Started

Introductory Tutorials

Feature Tutorials

Product Documentation

Reference Guides

Client Builder

Operations & Management

Design Modeler

Service Builder

Branding

App Builder

Assemblies

Catalog

Image Processing

External Sources

Storage Containers

Testing

AI

Core Platform

Release Notes

Articles

Architecture

Best Practices

Performance