Oracle process ID is null in FND_CONCURRENT_REQUESTS table
Couple of days back, I had the strange issue in our PROD environment.
End user called up and said that certain print job (custom job) has been running long and they were not able to see the output coming from the printer
We are on 11.5.9 with database version 9.2.0.6
Issue analysis/findings and resolution
- What has been done to the PROD system
- There were some code migrations. Although these code migrations did touch some of the objects related to the offending program but it was very unlikely to cause this strange behaviour.
- Application tier/concurrent tier server was bounced due to some h/w change. Although this supposedly h/w change was later on reverted due to some other reason
- application patch was applied. This patch was exclusively related to Oracle workflow.
- Initial analysis showed that there were many duplicate requests submitted by the user. Hence those long running duplicate requests were cancelled.
- Despite cancelling some of the requests, many more requests kept coming up and thereby pegging the concurrent queue
- When tried to find what these long running requests are doing at the backend, I couldn’t find the Oracle process ID. ORACLE_PROCESS_ID column in FND_CONCURRENT_REQUESTS was NULL for all those concurrent requests. This was strange behaviour since this particular request has been completing successfully till sometime (may be around 10-12 hrs before) ago. And this is the reason, possibility of ‘SRW Exit and Init’ trigger code not being there was ruled out.
- Many other concurrent jobs (Oracle standard as well custom) in this particular manager were completing successfully.
- No errors were seen in the concurrent manager log – specific to the custom job which was giving issue and in question.
- No errors were seen in the startup script. All the components had come gracefully when the application was bounced.
- Later on I decided to bounce the concurrent manager. Even after bouncing the concurrent manager, behaviour remained the same i.e. when the concurrent managers came back, huge number of requests started showing running. Strange thing was none of “those” requests were having Oracle process ID associated.
- Ran cmclean.sql script – since the behaviour was related to concurrent processing. Logged a Sev1 SR with oracle and Oracle also confirmed to run cmclean.sql and then start the concurrent manager
- Despite running cmclean.sql, problem is not resolved.
- Then it was decided to bounce the whole application along with the database. But even database bounce didn’t help and still all “those” requests kept coming up – but without any associated ‘Oracle process ID’. This really made the thing worst – as I was not able to identify if there is anything happening on the database
- Database was perfectly fine –
- No unusual database waits
- No other long running jobs
- No database locks
- CPU utilization was normal
- No persistent latches
- Later on I tried submitting a request for the same concurrent job to see if there is some issue with the schedule. But even my job remained in the queue – due to many other requests already in queue.
- Oracle was also not much forthcoming – despite we being on Sev1 SR.
- I tried one last chance – before jumping into some unknown troubleshooting steps.
- shut down all the services on app and web tier using adstpall
- Tried checking if there is any old defunct process on app and web tier. I saw many (around 300-400) dead/defunct processes owned by applmgr. These sessions were print queue processes and many of them were ora_rw20_run. Ensured that no ‘applmgr’ process is left
- Killed all the defunct processes owned by applmgr
- bounced the database and listener
- Restarted the app and web services
- ‘Concurrent job’ in question started showing as completing successfully. Number of pending jobs also started receding.
- This resolved the issue and PROD started working as expected but the exact root cause was not known
What could have caused it – Dead/defunct processes on the application tier as a result of unclean bounce of server.
- We had server bounce due to some h/w maintenance and I remember AIX guy had some issue while unmounting the application filesystem and it took significantly more time to bring down the AIX box as compared to other instances.
- Many print queue processes were showing of the same timestamp – as of server bounce.
- Even relinking FND as suggested by Oracle through SR didn’t help to address the issue.
This issue haunted for several days before I happened to find the root cause and solution. Since it was never clear what has been causing this issue, so only evident solution that had worked everytime is bounce applicaiton services along with database.
Last time, when the issue came, I just happned to see the TOP command output and found VNC process taking around 6-7% CPU – which didnot look normal. I killed the VNC process and as soon as VNC process was killed, all the requests showing as running (phase code and status code = R) went away. Concurrent Manager queue which was pegged started coming to normal.
All these concurrent jobs which was running long without any databas handle process were POSTSCRIPT report and hence need some sort of DISPLAY (VNC in our case).
Currently working to figure out if upgrading VNC would resolve the frequent problem. I will update the post once I find the permamnent solution (not bouncing VNC process).
-Anand M
Hi All,
Thank you so much for your article.
I was facing the same issues and the same steps you described.
Just wonder have you found out the root causes? If so, would you please share with me?
Thanks
Amy
Thanks Amy that it helped your but unfortunately, I could not find the cause.