[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: ITCS status tool feedback

This is the feedback blackops worked on together. To this I would add that the 24 hour limitation should be immediately rolled back, or a category for non-emergencies under 24 hours eg 'routine non user visible environment changes' should be added.


	From: 	  gelle@xxxxxxxxx
	Subject: 	outage/status tool feedback
	Date: 	June 2, 2005 3:25:40 PM EDT
	To: 	  blackops@xxxxxxxxx
	Cc: 	  akbrooks@xxxxxxxxx

We worked on these as a group. Not sure who you want them sent to, or if we should next get input from other UMCE groups.


1. Staff only notes should be mailable. Staff groups to receive outages with notes should be separately selectable from groups to receive the outage notice. Groups should include 'blackops' 'gpcc' 'swat.team' (web) 'wats' 'sites.tsg' etc. There are probably others. Currently we have to send additional explanatory messages to our groups following reporting outages. One of the features we were promised was one stop reporting, but currently there is no way to make a technical staff consumable report.
The uniqname of the reporter should be included with the staff notes. They are not currently in the mailed reports, which is fine for public consumption but nearly useless for staff.

Currently, the form has boxes roughly describable as:
report for users, report for managers, report for colleagues.
The report for colleagues part is crippled.

2. Routine maintenance should be reportable in one step!! (i.e. This change was just completed). Currently routine maintenaince is not even a category. Also, outages that already happened should also be reportable in one pass.... forcing people to click twice isn't going to stop us fixing first and reporting later. Directory crashes are a shining example of this. We restart the daemon then report that it crashed and was restored. It is not possible to determine if there is a more serious problem without first investigating. In simple cases this is concurrent with a fix. If there is an outage the oncall staff knows the solution to, the first step is usually resolution - then reporting.

3. Service checkboxes are needed for: Email gateway or Mail Routing, UMOD/Directory!!!!!!!, Printing, DHCP, SpamBox, and possibly email virus blocking.
IMAP Email is a very general category, mail delivery to imap is different than imap access. Webmail should be listed. Other groups likely have other additions.

4. Improved reporting and REPORT NAMING guidelines.
Currently, action in case of outages is clear.
Reporting criteria for scheduled maintenance w/o outage, routine maintenance and service degradations are less clear.

Cases: nefu restarts (currently listed as scheduled maintenance--no outage)
removing one machine from a redundant pool (currently listed as service degradation, but there is no actual service degradation at current load levels)
simta partial rollouts
slapd crashes (a simple restart, reported after a crash)

Production machines are changed regularly. Sometimes by human intervention (clearing tripwires) and some are automated (imap updates). Currently we are mandated to report human initiated changes but in some cases this is silly, particulary clearing tripwires and nefu restarts.

4 b. Report naming is currently being refined by 'freak out' reactions from managers. Please give us sample headings for reports that we should model. Ideally these would be automatically generated by the tool based on the features selected.
Naming guidelines for services. What are the user consumable service names we should be using? Should email delivery be called mail gateway, mail routing, email gateway, mail relay, mail forwarding or mail delivery etc). How should an IMAP outage (aka "email") be distinguished from a webmail outage? Can we have a list of the official names of all our services so we know what they are called?

5. Categories do not reflect all types of work, such as 'routine maintenance', we have been listing these as 'scheduled maintenence--no outage expected' but without allowing lead time for notification. Leading to aforementioned 'freak outs'
Should routine maintenance not be reported? If so, a routine maintenance category should be created.

6. Field for server name(s). The servers involved should be separately tracked so we can later search by machine. (optional field) Would be useful for staff
to later track problems. E.g. look up on frontend imap - could yield all the frontend imap maintenance, or specific names would probably only yield the emergency outages/maintenance.

7. Automated reminders about open reports for dates past. Human checking by outages group seems less useful.

We think we might have more later, but these are our immediate and pressing concerns.

Contact us if you have any questions at blackops@xxxxxxxxxx

On Oct 10, 2005, at 2:54 PM, Willie Northway wrote:

Please send me your thoughts on the current state of the ITCS status reporting tool. What do you like, as well as what you think should be changed - both technically and policy-wise. Only with this list in hand, will we be able to present an effective case for change. I'll bring this to a meeting, and I'd like to invite anyone who feels strongly to accompany me, so please identify yourself if you're interested.

Thanks - Willie

Willie Northway                  University of Michigan Webmaster Team
http://willienorthway.com/       http://www.umich.edu/~umweb/