Monday, May 4, 2009

Poor Man's Flight Recorder

Problem statement:
Sometimes when watching WPS you're faced with questions similar to "what is it doing rightnow?". In this situation you'd wish the product had some production level, low intrusive tracing capability, that shows you, what is going on on the programming model level. Which would e.g. be what process instances, BPEL activities, invokes, API calls, etc. are being executed "as we speak".
After some poking around in WPS' BPEDB I constructed an SQLs statement, that at least can serve as a simple surrogate for such a not-only-nice-to-have trace.

Solution: Poor Man's Flight Recorder:
Due to some very helpful timestamps in some of the tables, it made it easy to construct an SQL statement, that can show you what has been recorded in the BPEDB for the last 10 seconds (or just the last 2 seconds - change it as you like).
SELECT
ai.last_state_change as AI_last_state_change,
substr(atp.name,1,30) as AI_templatename,
case ai.state
when 0 then 'null(0)'
when 1 then 'inactive(1)'
when 2 then 'ready(2)'
when 3 then 'running(3)'
when 4 then 'skipped(4)'
when 5 then 'finished(5)'
when 6 then 'failed(6)'
when 7 then 'terminated(7)'
when 8 then 'claimed(8)'
when 9 then 'terminating(9)'
when 10 then 'failing(10)'
when 11 then 'waiting(11)'
when 12 then 'expired(12)'
when 13 then 'stopped(13)'
when 14 then 'processing_undo'
end as AI_state,
substr(pt.name,1,30) as PI_templatename,
case pi.state
when 0 then 'deleted(0)'
when 1 then 'ready(1)'
when 2 then 'running(2)'
when 3 then 'finished(3)'
when 4 then 'compensating(4)'
when 5 then 'failed(5)'
when 6 then 'terminated(6)'
when 7 then 'compensated(7)'
when 8 then 'terminating(8)'
when 9 then 'failing(9)'
when 10 then 'indoubt(10)'
when 11 then 'suspended(11)'
when 12 then 'compensation_fail'
end as PI_state,
ai.piid as Process_Instance_ID
FROM
activity_instance_b_t ai,
activity_template_b_t atp,
process_instance_b_t pi,
process_template_b_t pt
where
atp.atid = ai.atid and
ai.piid = pi.piid and
pi.ptid = pt.ptid and
ai.last_state_change > ((select max(last_state_change) from activity_instance_b_t) - 10 seconds)
order by
ai.last_state_change desc
fetch first 100 rows only
with ur ;
This statement is for IBM's DB2 database - Oracle users might want to use "WHERE ROWNUM < 100" instead of "fetch first 100 rows only".

Sample output:
(if it looks distorted, you may want to enlarge your browser window horizontally to fit it in)
AI_LAST_STATE_CHANGEAI_TEMPLATENAMEAI_STATEPI_TEMPLATENAMEPI_STATEPROCESS_INSTANCE_ID
-------------------------------------------------------------------------------------------------------------------
2009-04-03-10.22.37.545000Receive_waitforeverwaiting(11)tptmainrunning(2)x'900301206B806201FEFFFF808BEE1841'
2009-04-03-10.22.37.540000Receive_waitforeverwaiting(11)tptmainrunning(2)x'900301206B8061BBFEFFFF808BEE1828'
2009-04-03-10.22.37.517000Callsubfinished(5)tptmainrunning(2)x'900301206B8061BBFEFFFF808BEE1828'
2009-04-03-10.22.37.517000Callsubfinished(5)tptmainrunning(2)x'900301206B806201FEFFFF808BEE1841'
2009-04-03-10.22.37.507000Receive_waitforeverwaiting(11)tptmainrunning(2)x'900301206B805FCCFEFFFF808BEE17E4'
2009-04-03-10.22.37.477000Callsubfinished(5)tptmainrunning(2)x'900301206B805FCCFEFFFF808BEE17E4'
2009-04-03-10.22.37.461000Receive_waitforeverwaiting(11)tptmainrunning(2)x'900301206B805F5CFEFFFF808BEE17AB'
2009-04-03-10.22.37.451000-finished(5)tptsubfinished(3)x'900301206B806316FEFFFF808BEE18B7'
2009-04-03-10.22.37.451000Replyfinished(5)tptsubfinished(3)x'900301206B806316FEFFFF808BEE18B7'
2009-04-03-10.22.37.441000Snippet2finished(5)tptsubfinished(3)x'900301206B806316FEFFFF808BEE18B7'
2009-04-03-10.22.37.439000-finished(5)tptsubfinished(3)x'900301206B806318FEFFFF808BEE18B8'
2009-04-03-10.22.37.439000Replyfinished(5)tptsubfinished(3)x'900301206B806318FEFFFF808BEE18B8'
2009-04-03-10.22.37.432000Snippet2finished(5)tptsubfinished(3)x'900301206B806318FEFFFF808BEE18B8'
2009-04-03-10.22.37.400000Callsubfinished(5)tptmainrunning(2)x'900301206B805F5CFEFFFF808BEE17AB'
......
Explanation:
The generated list has the following columns:
  • AI_LAST_STATE_CHANGE = activity instance last state change
    The last point in time, the activitiy's state got changed - the table is sorted on this column, descending - i.e. newest entry first.
  • AI_TEMPLATENAME = the template name of the activity
  • AI_STATE = the state of the activity instance
  • PI_TEMPLATENAME = the name of the process template of the processinstance, this activity is part of
  • PI_STATE = the state of the process instance and finally
  • PROCESS_INSTANCE_ID = the ID of the process instance.
Possible performance impact:
The possible impact greatly depends on the size of the related tables. Most of the operation is being done on the ACTIVITY_INSTANCE_B_T table and there on the column LAST_STATE_CHANGE. By default, this column doesn't have an index. So if you experience very long execution times for the above SQL, you're most likely doing a table scan on the entire ACTIVITY_INSTANCE_B_T table. If so, you may want to define a suitable index for that column and perform a "runstats .... with distribution and detailed indexes all" on this table. This should reduce the table scan to an index scan, which should result in a noticeable reduction of the execution time of this SQL.

Disclaimer:
This SQL is lightyears away from being a fully fledged flight recorder, that shows all relevant events. On the other side it can at least provide some initial insight for problem determination purposes and it is for sure far less intrusive (in terms of resource consumption) than a full blown BPE trace.

(finally also published (slightly modified) at http://www-01.ibm.com/support/docview.wss?uid=swg21384848 )

1 comment:

Anonymous said...

Is there anything preventing me from doing a dirty read?