Production support covers the practices and disciplines of supporting the IT systems/applications which are currently being used by the end users. A production support person/team is responsible for monitoring the production servers, scheduled jobs, incident management and receiving incidents and requests from end-users, analyzing these and either responding to the end user with a solution or escalating it to the other IT teams. These teams may include developers, system engineers and database administrators.
The importance of production support
In order to understand the importance of production support, one needs to take a few factors into account.
- Studies have found that the maintenance cost of software is more than 90% of the total cost.
- We also know that software spends much more time in production than development because it needs to verify the properties.
- The maintenance cost of the hardware.
From the factors listed above, one can see that the way in which production support is managed is extremely crucial.
Production Support Steps
The major steps for Production Support are as below. These Production Support steps are in context of the Batch processing.
Recording Production Error
Usually a batch job or group of related batch jobs (schedule/stream) runs to accomplish one or more business functions. These batch jobs run unattended and normally complete without any errors or issues. However, sometimes the batch job can have a break/interruption/abend/abort. There could be several reasons why a job could abend.
When a job abends, it can send out an automated alert notification via e-mail, page, text. Also, data center or operations team is also actively monitoring the jobs. They also send alert notification using e-mail, page, text or they can call the on call person responsible for the recovery of the abended job.
The on call person acknowledges the e-mail, page, text or phone call for the abended job. The on call person also records the abended job details in a production issue tracking system. Sometimes, the abended job automatically records the job abend details along with job standard list (job log) in a production issue tracking system. The abended job details (job standard list, error log files, etc.) are available in the production job scheduler tool. The Production issue tracking tool creates a request number and this request number is given to the support team. This request number is used to track the progress of the production support issue. The request is assigned to on call support team person.
Notification of Production Error
For critical Production Errors (e.g. Production job is in critical path and is likely to delay the batch completion SLAs and if the Production error is impacting business data), an e-mail is sent to entire organization or impacted teams so that they are aware of the issue. They are also provided with the estimated time for Production error recovery.
Investigation or Analysis of Production Error
The Production support team on call person collects all the necessary information about the Production error. This information is then recorded in the Production error tracking tool using the correct support request number previously assigned. All the details such as data, environment, process, program logic that failed is used in the investigation. Production batch job, program used or any tool/utility used is reviewed for any possible errors.
Resolution of Production Error
If similar Production error occurred in the past then the issue resolution steps are retrieved from the support knowledge base and error is resolved using those steps. If it is a new Production error then new Production error resolution steps are created and Production error is resolved. The new Production error resolution steps are recorded in the knowledge base for the future usage. For major Production errors (critical infrastructure or application failures), a phone conference call is initiated and all required support persons/teams join the call and they all work together to resolve the error. This is also called as an Incident Management. If a problem occurs repeatedly then it is recorded and tracked using appropriate tools and processes until it is resolved permanently. This is also called as Problem Management. The issue is closed only after the customer or end user agrees that the problem is resolved.
Production job/program code correction
If the Production error occurred due to programming errors then a request is created for the Development team to correct programming errors. Problem is identified, defined and root cause analysis is performed. The programming error is fixed using normal SDLC process - analysis/design/programming/QA/testing/release. The new version of the Production job/program is deployed and verified/validated.
Production Process correction
If the Production error occurred due to job/schedule dependency issues or sequence issues then further analysis is done to find the correct sequence/dependencies. The new sequence/dependencies are verified and validated in test environment before Production deployment.
Infrastructure Issue correction
If the Production error occurred due to infrastructure issues then the specific infrastructure team is notified. The infrastructure team then implements permanent fix for the issue and monitors the infrastructure to avoid same error again.
Production Support Billing
If the Production error occurred due to unexpected consequences of infrastructure changes then most often the infrastructure team is not able to bill the time spent in resolving of the issue at the full rate. In some cases hours are completely disqualified from being billed.
Production Support - Follow up and Reporting
The Production error tracking system is used to review all issues periodically (daily, weekly and monthly) and reports are generated to monitor resolved issues, repeating issues, pending issues. Reports are also generated for the IT/IS management for improvement and management of Production jobs.