WebSphere Process Server Performance

Tuesday, February 12, 2013

SIB queue monitoring

When things go wrong, sometimes they aren't noticed early enough. Queue depth monitoring can provide some means of detecting things that might be wrong before the system hits the wall, e.g. by hitting the max queue depth of a vital queue.
A good approach would e.g. be to have a script, that is periodically kicked off by cron and that records the depth of important queues into a file, which can then be fed into any kind of monitoring and visualization mechanism using some additional scripting.

This sample here is a jython script, that you can call with

wsadmin -lang jython -f my-queue-depth-monitor.py

"wsadmin" is located in the /bin directory.
A sample my-queue-depth-monitor.py file could look like:

import time
print "Start collecting queue depths - date/time:", time.ctime()
print
allqueues=AdminControl.queryNames('WebSphere:*,name=BPEIntQueue_BPMP.AppTarget,type=SIBQueuePoint').splitlines()
allqueues=allqueues + AdminControl.queryNames('WebSphere:*,name=sca/mySCAModule,type=SIBQueuePoint').splitlines()
allqueues=allqueues + AdminControl.queryNames('WebSphere:*,name=BPEHldQueue_BPMP.AppTarget,type=SIBQueuePoint').splitlines()
allqueues=allqueues + AdminControl.queryNames('WebSphere:*,name=WBI.FailedEvent.BPMP.AppTarget,type=SIBQueuePoint').splitlines()
allqueues=allqueues + AdminControl.queryNames('WebSphere:*,name=HTMHldQueue_BPMP.AppTarget,type=SIBQueuePoint').splitlines()
i=0
print "sequence number, queue name, current queue depth"
for queue in allqueues:
    i=i+1
    identifier=AdminControl.getAttribute(queue, 'identifier')
    depth=AdminControl.getAttribute(queue, 'depth')
    print i, identifier, depth
print

To use this, you need to adjust the names of the queues you'd like to monitor. The easiest would e.g. be to use the admin console, go to "Service Integration" -> "Service Integration Bus Browser" and then get the correct names of the queue points in the various buses on the right pane.
(with credits to my script co-author Sascha P.)

Friday, February 3, 2012

Do RUNSTATS at the right point in time

Updating statistics about table and index cardinalities and the distribution of values in there to give the optimizer the right information, so that it can determine the best possible access path to the data is one of the BAU tasks of database maintenance. In WPS installations the usual recommendation is to do it once a week during off-peak hours, as performing runstats means to read all the data, which represents a certain impact that you might want to avoid during peak hours.

However there might be a pattern, where such an approach is quite counter-productive. In a recent case runstats was done weekly on a Sunday night. The processing workload consisted of primarily microflows and some (short lived) macroflows. The observed problem was processing slowdowns due to a major amount of deadlocks (up to 10 per minute) during the peak times of the day.

Checking the explains for the deadlocked statements showed, that the database's optimizer didn't take the best possible index. There was an index, that covered all the columns in the where-clause, but this index was not taken by the optimizer, but the primary key index instead.

A few SELECT COUNT statements on the involved tables showed a low and changing cardinality (just a few dozen rows). Together with the information about the short life-time of the macroflows and the daily processing pattern where almost everything happens during typical office hours, the idea of performing runstats on a Sunday night probably lead to collecting statistics on empty tables.
And when the optimizer believes (due to the statistics it has) that the tables are empty, it might take an access path that involves e.g. more locking than necessary.

Now the suggestion was to perform runstats during a typical peak time, where the involved BPEDB tables contained a few dozen rows of data.

Tataaa - deadlocks were gone.

And the explain on the SQL statements in question showed that exactly the access path was taken that theoretically would be the best one.

Conclusion:
Sometimes runstats should be done during peak times - especially when the workload pattern is such that you could expect no or almost no process instance data being available at those off-peak hours where you typically would do database maintenance.

Friday, July 15, 2011

Redpaper on Performance and Capacity Implications for a Smarter Planet

Instead of telling something about "low level" performance tips, I tried to contribute some stuff on "some higher" levels this time.

This IBM Redpaper™ publication discusses the performance and capacity implications of "Smarter Planet" solutions. It examines the Smarter Planet vision, the characteristics of these solutions (including the challenges), examples of addressing performance and capacity in a number of recent Smarter IT projects, recommendations from what has been learned thus far, and discussions of what the future may hold for these solutions.

Table of contents:

Introduction to a Smarter Planet
Performance and capacity challenges introduced by Smarter Planet
Managing IT performance and capacity for the Smarter Planet
Five examples of Smarter IT projects and challenges
Recommendations
Conclusion

It can be found at http://www.redbooks.ibm.com/redpieces/abstracts/redp4762.html?Open

Friday, June 19, 2009

Performance boost through disabling file system caching

In a recent engagement to troubleshoot lousy performance, colleagues of mine were able to "fix it".

Usually it is a good thing, when an operating system (here: kernel and file system code) caches data in buffers at the file system level and flushes it out to disk when appropriate (and in the right chunks) to minimize the amount of physical I/O operations.

When placing database files on such filesystems however, the filesystem's caching algorithms can be extremely counterproductive.

In the case here, Solaris, the Veritas filesystem, and Oracle was involved. Re-mounting the filesystem with the parameter

mincache=direct

reduced the time for SQL inserts into the SIB tables from up to 11 seconds down to 0.03 seconds.

When dealing with the combination AIX, JFS2, and DB2, the DB2 command

db2 alter tablespace no file system caching

has more or less the same effect. Data caching on the filesystem level is disabled and the data is persisted as fast as possible. According to the AIX documentation the same effect can be reached on the file system level by using "mount ... -o dio". Read performance for non-DB files might suffer, because caching is reduced. DB-data will still be cached in the DB bufferpools.

Friday, May 8, 2009

SOA/BPM performance best practices (System and subsystem configuration)

This section handles some scalability related tuning knobs and some relevant tuning parameters for the involved subsystems like the JVMs and the databases.

Clustering topologies:
In order to take care of growth and workload distribution, modern business process engines can run in a clustered setup, spreading the workload across various physical nodes (horizontal scaling) or for better utilizing spare resources within existing nodes (vertical scaling).
For WPS three different cluster topology patterns have been identified and described e.g in http://www-01.ibm.com/support/docview.wss?uid=swg27010320&aid=1 or http://www.ibm.com/developerworks/websphere/library/techarticles/0703_redlin/0703_redlin.html .

The first pattern (shown on the left) is also known as the “bronze topology”. It consists of a single application server cluster, where the WPS business applications, the support applications like CEI and BPC Explorer, and the messaging infrastructure hosting the messaging engines (MEs) that form the system integration buses (SIBus) all reside within each of the application servers, that form the cluster.
This bronze topology is suitable for a solution, that comprises of only synchronous web services and synchronous SCA invocations, preferably with short running flows only.

The second pattern (shown in the middle) is also known as the “silver topology”. It has two clusters, the first one containing the WPS business applications and the support applications as before, but the messaging infrastructure is located in the second cluster.
This silver topology is suitable for a solution that uses long running processes, but does neither need CEI, nor message sequencing, nor asynchronous deferred response, nor JMS or MQ bindings, nor message sequencing mechanisms.

The third pattern (shown on the right) is called the “golden topology”. Compared to the previous patterns, the support applications are separated into a third cluster.
This golden topology is suited for all the remaining cases, where asynchronous processing plays a nontrivial role in the solution. It also provides the most “JVM space” for the business process applications that should run in this environment. If the available hardware resources allow for setting up this golden topology, then it is advisable to start with this topology pattern from the very beginning as it is the most versatile one.
What is not shown in the above figure is the management infrastructure, that controls the cluster(s). These consist of node agents and a deployment manager node as the central point of administration of the entire cell, these clusters belong to. A tuning tip for this management infrastructure is to turn off automatic synchronization of the node configurations. Depending on the complexity of the setup, this synchronization processing is better kicked off manually during defined maintenance windows in off-peak times.

JVM Garbage Collection
Verbose garbage collection is not as verbose as the name suggests. Those few lines of information that are produced, when verboseGC is turned on don't really hurt the system's performance. On the other side they can be a very helpful source of information when troubleshooting performance problems.
The JVM used by WPS V6.1 supports several garbage collection strategies: the Throughput Garbage Collector (TGC), the Low Pause Garbage Collector (LPGC), and the Generational Garbage Collector (GGC).
The TGC provides the best out-of-box throughput for applications running on a JVM by minimizing costs of the garbage collector. However, it has “stop-the-world” phases that can take between 100ms and multiple seconds during garbage collection.
The LPGC provides garbage collection in parallel to the JVM’s work. Due to increased synchronization costs, throughput decreases. If response time is more important than highest possible throughput, this garbage collector could be a good choice.
The GGC is new in the IBM 1.5 JVM. It is well suited for applications that produce a lot of short-lived small Java objects. As it reduces pause times it should be tried in such cases instead of the TGC or LPGC. When properly tuned, it provides the best garbage collection performance for SOA/BPM workloads. [http://www.redbooks.ibm.com/abstracts/redp4431.html]

JVM memory considerations
Increasing the heap size of the JVM of the application server can improve the throughput of business processes. However it should ensured, that there is enough real memory available to avoid that the operating system would start swapping. Detailed information on JVM parameter tuning can be found in [http://www.ibm.com/developerworks/java/jdk/diagnosis/].

Database subsystem tuning
To a large degree the performance of long running flows and/or human tasks in a SOA/BPM solution depends on a properly tuned, enterprise class database management system besides the afore mentioned application server tuning. This paper provides some tuning guidelines for IBM's DB2 database system as an example. Most of the rules should also be applicable to other production database management systems.
It is not advisable to use simple file based databases like Cloudscape or Derby as a database management system for WPS other than for the purpose of unit testing.

Configuration advisor
DB2 comes with a built-in configuration advisor. After creating the database, the advisor can be used to configure the database for the usage scenario expected. The input for the Configuration Advisor depends on the actual system environment, load assumptions, etc. Details on how to use this advisor can be found in [http://www-01.ibm.com/support/docview.wss?uid=swg27012639&aid=1]. Some parameter settings in the output of the advisor should be checked and adjusted afterwards.

MINCOMMIT A value of ‘1’ is strongly recommended. The advisor sometimes suggests other values.
NUM_IOSERVERS The value of NUM_IOSERVERS should match the number of physical disks (+2) the database resides on.
NUM_IOCLEANERS Especially on multi-processor machines, enough IO cleaners should be available to make sure that dirty pages in the bufferpool are written to disk. Provide at least one IO cleaner per processor.

Database statistics
Optimal database performance requires the database optimizer to do its job well. The optimizer acts based on statistical data about the number of rows in a table, the use of space by a table or index, and other information. When the system is set up, these statistics are empty. As a consequence the optimizer usually takes sub-optimal decisions, leading to poor performance.
Therefore after initially putting load on your system, or whenever the data volume in the database changes significantly, you should update the statistics by running the RUNSTATS utility (DB2). Make sure there is sufficient data (> 2000 process instances) in the database before you run RUNSTATS. Avoid running RUNSTATS on an empty database as this will lead to bad performance.

Enable Re-Optimization
If BPC API queries (as used by the BPC Explorer e.g.) are used regularly on your system, it is recommended to allow the database to re-optimize SQL queries once, as described at [http://www-01.ibm.com/support/docview.wss?rs=2307&uid= swg21299450]. This tuning step greatly improves the response times of BPC API queries. In Lab tests the response time for one and the same query has been reduced from over 20 seconds down to 300 milliseconds. With improvements in such orders of magnitude the additional overhead for re-optimizing SQL queries should be affordable.

Database indexes
In most cases the BPM product's datastore has not been defined such, that all the database indexes that might potentially be used have been defined. In order to avoid unnecessary processing out of the box, it is much more likely, that only those indexes have been defined, that are necessary to run the most basic queries with an acceptable response time.
As a tuning step one can do some analysis on the SQL statements resulting from end user queries to see, how the query filters used by the end user (or in the related API call) relate to the WHERE clauses in the resulting SQL statements and define additional indexes on the related tables to improve the performance of these queries. After defining new indexes, the above mentioned RUNSTATS action needs to be run to enable the use of the newly created indexes.
Sometimes customers are uncertain about whether they are turning their environment into an unsupported state when defining additional indexes. This is definitively not the case. Customers are even encouraged to apply such tuning steps and check whether they help. If not, they can be undone easily e.g. by removing the index.

Further database tuning
Any decent database management system can keep its data in memory buffers called bufferpools to avoid physical I/O. Data, that is in these bufferpools needs not be read from disk when referred to, it can be taken from these memory buffers directly. Hence it makes a lot of sense to make these buffers large enough to hold as much data as possible.
The key tuning parameter to look at is called bufferpool hit ratio and describes the ratio between the physical data and index reads and the logical reads. As a rule of thumb you can increase the size of the buffer pools as long as you get a corresponding increase of the bufferpool hit ratio. A well tuned system can easily have a hit ratio well above 90%.
WPS accesses it's databases in multiple concurrent threads and uses row level locking to ensure data consistency during it's transactions. As a result, there can be a lot of row locks being active at times of heavy processing. The related database parameters for the space, where the database maintains the lock information might have to be adjusted.
For DB2 the affected database configuration parameters are LOCKLIST and MAXLOCKS. Shortages in this lock maintenance space can lead to so called lock escalations, where row locks are escalated to undesirable table locks, which even can lead to deadlock situations. Data integrity is still maintained in such situations, but the associated wait times can severely impact throughput and response times.

Monday, May 4, 2009

Poor Man's Flight Recorder

Problem statement:
Sometimes when watching WPS you're faced with questions similar to "what is it doing rightnow?". In this situation you'd wish the product had some production level, low intrusive tracing capability, that shows you, what is going on on the programming model level. Which would e.g. be what process instances, BPEL activities, invokes, API calls, etc. are being executed "as we speak".
After some poking around in WPS' BPEDB I constructed an SQLs statement, that at least can serve as a simple surrogate for such a not-only-nice-to-have trace.

Solution: Poor Man's Flight Recorder:
Due to some very helpful timestamps in some of the tables, it made it easy to construct an SQL statement, that can show you what has been recorded in the BPEDB for the last 10 seconds (or just the last 2 seconds - change it as you like).

SELECT
ai.last_state_change as AI_last_state_change,
substr(atp.name,1,30) as AI_templatename,
case ai.state
when 0 then 'null(0)'
when 1 then 'inactive(1)'
when 2 then 'ready(2)'
when 3 then 'running(3)'
when 4 then 'skipped(4)'
when 5 then 'finished(5)'
when 6 then 'failed(6)'
when 7 then 'terminated(7)'
when 8 then 'claimed(8)'
when 9 then 'terminating(9)'
when 10 then 'failing(10)'
when 11 then 'waiting(11)'
when 12 then 'expired(12)'
when 13 then 'stopped(13)'
when 14 then 'processing_undo'
end as AI_state,
substr(pt.name,1,30) as PI_templatename,
case pi.state
when 0 then 'deleted(0)'
when 1 then 'ready(1)'
when 2 then 'running(2)'
when 3 then 'finished(3)'
when 4 then 'compensating(4)'
when 5 then 'failed(5)'
when 6 then 'terminated(6)'
when 7 then 'compensated(7)'
when 8 then 'terminating(8)'
when 9 then 'failing(9)'
when 10 then 'indoubt(10)'
when 11 then 'suspended(11)'
when 12 then 'compensation_fail'
end as PI_state,
ai.piid as Process_Instance_ID
FROM
activity_instance_b_t ai,
activity_template_b_t atp,
process_instance_b_t pi,
process_template_b_t pt
where
atp.atid = ai.atid and
ai.piid = pi.piid and
pi.ptid = pt.ptid and
ai.last_state_change > ((select max(last_state_change) from activity_instance_b_t) - 10 seconds)
order by
ai.last_state_change desc
fetch first 100 rows only
with ur ;

This statement is for IBM's DB2 database - Oracle users might want to use "WHERE ROWNUM < 100" instead of "fetch first 100 rows only".

Sample output:
(if it looks distorted, you may want to enlarge your browser window horizontally to fit it in)

AI_LAST_STATE_CHANGE AI_TEMPLATENAME AI_STATE PI_TEMPLATENAME PI_STATE PROCESS_INSTANCE_ID
-------------------------- ------------------- ---------- --------------- ----------- ----------------------------------
2009-04-03-10.22.37.545000 Receive_waitforever waiting(11) tptmain running(2) x'900301206B806201FEFFFF808BEE1841'
2009-04-03-10.22.37.540000 Receive_waitforever waiting(11) tptmain running(2) x'900301206B8061BBFEFFFF808BEE1828'
2009-04-03-10.22.37.517000 Callsub finished(5) tptmain running(2) x'900301206B8061BBFEFFFF808BEE1828'
2009-04-03-10.22.37.517000 Callsub finished(5) tptmain running(2) x'900301206B806201FEFFFF808BEE1841'
2009-04-03-10.22.37.507000 Receive_waitforever waiting(11) tptmain running(2) x'900301206B805FCCFEFFFF808BEE17E4'
2009-04-03-10.22.37.477000 Callsub finished(5) tptmain running(2) x'900301206B805FCCFEFFFF808BEE17E4'
2009-04-03-10.22.37.461000 Receive_waitforever waiting(11) tptmain running(2) x'900301206B805F5CFEFFFF808BEE17AB'
2009-04-03-10.22.37.451000 - finished(5) tptsub finished(3) x'900301206B806316FEFFFF808BEE18B7'
2009-04-03-10.22.37.451000 Reply finished(5) tptsub finished(3) x'900301206B806316FEFFFF808BEE18B7'
2009-04-03-10.22.37.441000 Snippet2 finished(5) tptsub finished(3) x'900301206B806316FEFFFF808BEE18B7'
2009-04-03-10.22.37.439000 - finished(5) tptsub finished(3) x'900301206B806318FEFFFF808BEE18B8'
2009-04-03-10.22.37.439000 Reply finished(5) tptsub finished(3) x'900301206B806318FEFFFF808BEE18B8'
2009-04-03-10.22.37.432000 Snippet2 finished(5) tptsub finished(3) x'900301206B806318FEFFFF808BEE18B8'
2009-04-03-10.22.37.400000 Callsub finished(5) tptmain running(2) x'900301206B805F5CFEFFFF808BEE17AB'
......

Explanation:
The generated list has the following columns:

AI_LAST_STATE_CHANGE = activity instance last state change
The last point in time, the activitiy's state got changed - the table is sorted on this column, descending - i.e. newest entry first.
AI_TEMPLATENAME = the template name of the activity
AI_STATE = the state of the activity instance
PI_TEMPLATENAME = the name of the process template of the processinstance, this activity is part of
PI_STATE = the state of the process instance and finally
PROCESS_INSTANCE_ID = the ID of the process instance.

Possible performance impact:
The possible impact greatly depends on the size of the related tables. Most of the operation is being done on the ACTIVITY_INSTANCE_B_T table and there on the column LAST_STATE_CHANGE. By default, this column doesn't have an index. So if you experience very long execution times for the above SQL, you're most likely doing a table scan on the entire ACTIVITY_INSTANCE_B_T table. If so, you may want to define a suitable index for that column and perform a "runstats .... with distribution and detailed indexes all" on this table. This should reduce the table scan to an index scan, which should result in a noticeable reduction of the execution time of this SQL.

Disclaimer:
This SQL is lightyears away from being a fully fledged flight recorder, that shows all relevant events. On the other side it can at least provide some initial insight for problem determination purposes and it is for sure far less intrusive (in terms of resource consumption) than a full blown BPE trace.

(finally also published (slightly modified) at http://www-01.ibm.com/support/docview.wss?uid=swg21384848 )

Thursday, April 30, 2009

SOA/BPM performance best practices (Process Engine configuration)

Different process engines have different tuning options. Here some relevant options for WPS will be discussed.

Thread pool sizes
While long-running business processes spend most of their lifetime in the default thread pool (JMS based navigation) or WorkManager thread pool (WorkManager based navigation), short running processes don’t have a specific thread pool assigned to it. Dependent on from where the request to run a microflow comes from, a microflow runs within:

the ORB thread pool (e.g. the microflow is started from a different JVM with remote EJB invocation)
the Web container thread pool (e.g. the microflow is started using a http request)
the default thread pool (e.g. the microflow is started using a JMS message)

If microflow parallelism is not sufficient, examine your application and increase the respective thread pool. The key is to maximize the concurrency of the end-to-end application flow until the available processor or other resources are maximized.

Navigations mechanisms for long running flows
The business process engine in WPS processes long-running flows using a number of chained transactions. There are two types of process navigation techniques in WPS (since V6.1): JMS based navigation and WorkManager based navigation. Both types of navigation provide the same quality of service. The default is JMS based navigation. In Lab tests, WorkManager based navigation has shown throughput improvements of up to 100%.

However the behaviour of the system changes a bit. If JMS based navigation is used, there is no scheme (for example age-based or priority-based) to process older or more highly prioritized business process instances first. This makes it hard to predict the actual duration of single instances, especially on an heavily loaded system. If using WorkManager based navigation, currently processed instances are being further processed as long as there is outstanding work for them. While this is quite efficient, it prefers running process instances.

Resource dependencies
Start adjusting the parameters for the SCA and MDB activation specs and the WorkManager threads (if used) and then continue down the dependency chain as depicted below:

All those engine threads on the left side of the picture require JDBC connections to databases and it is very helpful for the throughput of the system if they don't have to wait for database connections from the related connection pools.

Other process engine tuning knobs
Business process engine environments have means to record what is happening inside, either for monitoring purposes or for doing problem determination. Any such recordings require a certain amount of processing capacity, so one should try to minimize any monitoring recordings and definitively disable traces for problem determination for normal operations.
Some of the data stores on out of the box configurations may default to simple file-based single user databases like Derby. This eases simple setups, since some database administrative tasks can be avoided. When performance and production level transactional integrity is more important, then it is advisable to place these data stores on production level database systems. Throughput characteristics could improve by factors of 2 to 5.
When using WPS' common event infrastructure (CEI) for recording business relevant events it might help to disable the CEI data store within WPS since CEI consumes these events and stores them in it's own database. Also validation of CEI events could be turned off, once it has been verified, that the emitted events are valid.