[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: ITCS status tool feedback
This is the feedback blackops worked on together. To this I would add
that the 24 hour limitation should be immediately rolled back, or a
category for non-emergencies under 24 hours eg 'routine non user
visible environment changes' should be added.
Subject: outage/status tool feedback
Date: June 2, 2005 3:25:40 PM EDT
We worked on these as a group. Not sure who you want them sent to, or
if we should next get input from other UMCE groups.
IN ORDER OF PRIORITY
1. Staff only notes should be mailable. Staff groups to receive
outages with notes should be separately selectable from groups to
receive the outage notice. Groups should include 'blackops' 'gpcc'
'swat.team' (web) 'wats' 'sites.tsg' etc. There are probably others.
Currently we have to send additional explanatory messages to our groups
following reporting outages. One of the features we were promised was
one stop reporting, but currently there is no way to make a technical
staff consumable report.
The uniqname of the reporter should be included with the staff
notes. They are not currently in the mailed reports, which is fine for
public consumption but nearly useless for staff.
Currently, the form has boxes roughly describable as:
report for users, report for managers, report for colleagues.
The report for colleagues part is crippled.
2. Routine maintenance should be reportable in one step!! (i.e. This
change was just completed). Currently routine maintenaince is not even
a category. Also, outages that already happened should also be
reportable in one pass.... forcing people to click twice isn't going to
stop us fixing first and reporting later. Directory crashes are a
shining example of this. We restart the daemon then report that it
crashed and was restored. It is not possible to determine if there is
a more serious problem without first investigating. In simple cases
this is concurrent with a fix. If there is an outage the oncall staff
knows the solution to, the first step is usually resolution - then
3. Service checkboxes are needed for: Email gateway or Mail Routing,
UMOD/Directory!!!!!!!, Printing, DHCP, SpamBox, and possibly email
IMAP Email is a very general category, mail delivery to imap is
different than imap access. Webmail should be listed. Other groups
likely have other additions.
4. Improved reporting and REPORT NAMING guidelines.
Currently, action in case of outages is clear.
Reporting criteria for scheduled maintenance w/o outage, routine
maintenance and service degradations are less clear.
Cases: nefu restarts (currently listed as scheduled maintenance--no
removing one machine from a redundant pool (currently listed as
service degradation, but there is no actual service degradation at
current load levels)
simta partial rollouts
slapd crashes (a simple restart, reported after a crash)
Production machines are changed regularly. Sometimes by human
intervention (clearing tripwires) and some are automated (imap
updates). Currently we are mandated to report human initiated changes
but in some cases this is silly, particulary clearing tripwires and
4 b. Report naming is currently being refined by 'freak out' reactions
from managers. Please give us sample headings for reports that we
should model. Ideally these would be automatically generated by the
tool based on the features selected.
Naming guidelines for services. What are the user consumable service
names we should be using? Should email delivery be called mail gateway,
mail routing, email gateway, mail relay, mail forwarding or mail
delivery etc). How should an IMAP outage (aka "email") be
distinguished from a webmail outage? Can we have a list of the
official names of all our services so we know what they are called?
5. Categories do not reflect all types of work, such as 'routine
maintenance', we have been listing these as 'scheduled maintenence--no
outage expected' but without allowing lead time for notification.
Leading to aforementioned 'freak outs'
Should routine maintenance not be reported? If so, a routine
maintenance category should be created.
6. Field for server name(s). The servers involved should be separately
tracked so we can later search by machine. (optional field) Would be
useful for staff
to later track problems. E.g. look up on frontend imap - could yield
all the frontend imap maintenance, or specific names would probably
only yield the emergency outages/maintenance.
7. Automated reminders about open reports for dates past. Human
checking by outages group seems less useful.
We think we might have more later, but these are our immediate and
Contact us if you have any questions at blackops@xxxxxxxxxx
On Oct 10, 2005, at 2:54 PM, Willie Northway wrote:
Please send me your thoughts on the current state of the ITCS status
reporting tool. What do you like, as well as what you think should be
changed - both technically and policy-wise. Only with this list in
hand, will we be able to present an effective case for change. I'll
bring this to a meeting, and I'd like to invite anyone who feels
strongly to accompany me, so please identify yourself if you're
Thanks - Willie
Willie Northway University of Michigan Webmaster Team