The users guide for the monitoring system SNMPSTAT/WWW for IP networks.

This document describe the monitoring system designed for the IP providers and used in RELCOM and a few other russion networks. This system is oriented to the 24x7 operator's staff and operator's reglament is described in the distinct document.

1. Common terms and objects for the monitor.

1.1. The content of the system.

This system consist of this parts:

The monitoring process looks so:

The pre-defined information pages (allocated in 'OUT' subdirectory) are created for the speed and could be requested by the CGI script too.

1.2. Monitoring objects.

The system control:

Every object does have the type defined by the one letter (R,I,B,M, see above), and it's unique name. There are collected for the every object:

Every object is polled every 10 - 30 seconds, as it is defined in Poll.conf configuration file. The results are written into the 'IFSUM' file and the system draw the screen view every 30 seconds (or for every CGI request). The data polled are summarised for some _average_ period (usially it is 2 - 3 minutes) and then recorded into the accounting ('stat') file, where boths _average_ and _maximum_ values are written. This files are used for the _graphics_, _reports_ or can be seen as _raw_ data by the operator's (WWW) screen.

snmpstatd daemon (which poll routers in background) define, if the object state is normal or not, and install the status _OK_, _WARNING_ or _ERROR_. In addition, the status _UNDEFINED_ can be installed if the daemon can not collect the data about an object. The WARNING state is equivalent to the _OVERLOADED_. The WWW system convert this states (O, W, E, U) by the adding _priority_ digit in accordance with the time (the E status converts first to E0 state, then to E1 or E4 state in dependence with the object priority, and so on), this helps to prevent operator from the watching the frauded (short failures etc) events. The state define the color used to draw an object on the screen, and (in some cases) the sound clip the system play in case of the important events.

In the WWW views, the status is shown by the color, and some other paramenets by the numbers in the table and (for the channel) by the colored bars. An operator can choose the screen view - total view, alarms only, or the full view for the single router. In addition, there is _status_ view where the total number of the different objects in the different states are shown, and (important just this view is responsible for the music alarms.

1.3. States.

An object state is generated by the monitor, and can be changed by the _operator tickets_. An _operator ticket_ is the record in the journal which define NEW state as derived from the OLD one, with some comments, time of expiration and (may be) the condition when this ticket will be removed (it's the flag _remove this ticket in case of the restored normal state_). There is only a few states generated by the monitoring, and more states could be generated by the operator. Moreover, there is 2 types of the tickets - _permanent comment_ and _the comment to the current state_, furst is used by the seniour operator or by the sysadmin to change the object status (and priority) permanently, and should be replaced by the priority in the future revisions.

Table below describe all states, their origins and corresponding colors/sounds for the default configuration (this is configurable and can be changed in the installation process):

Table 1. Standard states.

Name Origin CÏÓÔÏÑÎÉÅ Color Weight Sound
BGP Channel: Router:
E0 MÏÎÉÔÏÒ Just failed MAROON 220
E1 MÏÎÉÔÏÒ Failure RED 270 sound,muz.
E2 Operator Failure - fixing in progress AQUA 250
E3 Operator Failure - cant be fixed PURPLE 210
E4 Monitor or operator Important failure FUCHSIA 280 sound,muz. sound,muz.
O0 Monitor Just restored LIME 10
O1 Monitor Normal GREEN 5
O2 Monitor Normal GREEN 5
U0 Monitor or operator No data BLUE 200
U1 Operator No consider GRAY 200
U2 Operator In debug NAVY 200
U3 Operator Out of our competnce BLACK 200
W0 Monitor Overload appear OLIVE 120
W1 Monitor Overload YELLOW 180
W2 Operator Overload cant be fixed TEAL 150

In this table above origin describe where this state can appear from. The monitoring system itself can create O0, O1 (everything is just OK and everything is OK) states, E0, E1 (error, E1 means _error appeared recently_ and E1 means _error does have place more than 2 minute), E4 (as E1 but for the IMPORTANT objects where this revision determin if the object is IMPORTANT by the object's name - all objects named by the CAPITAL letters are important (it'll be changed in future releases); W0, W1 (the warnings - just appeared or existing more than 2 minutes), U0 - can not found object or monitor data for it

The E4 state allow to select important errors influenced the total network instead of the one object only. In this release it can be defined via the _permanent comment_ by the sysadmin, or the system treat any E1 for the CAPITAL_letters named object as E4.

All other states can be defined by the operators and their goal is to describe real (detailed by the operator) object state better

The rules used by the operators for the state installation should be defined in the _OPERATION GUIDE_ and depends from the company. The common rule is to set up any state different of E1 / E4 states for all failures which do not influense the total network as a whole, to allow operators to see new events when they are appeared. If the operators follow this policy, they show always all new and uninvestigated errors (failures) in the STATE window, and you can always see such events colored by the RED color on the ALARM window. In the future releases we decrease the number of operator-defined states to the little 2 states (Failure is fixed, and Failure can't be fixed for now), but with additional _PRIORITY_ allowed to mark any object as _for example) /NOTHING object - priority 0, or VERY IMPORTANT object - priority 5.

There is very important feature of the monitor to play music clips in case of some errors - for this revision, it is any ERROR with the router and an errors with the important INTERFACES. The clips could be listen from the table above, and their names and the statuses caused this clips to sound could be changed by the configuration. There is 2 ways to play clips - MIDI (recommended) plugin and _WAV_ (not recommended) plugin, first choise named _MUSIC_ and second as _SOUND_ everywhere in the tables and select menus. You should install MIDI plugins to use this feature; monitor try to determine if your brouser support MIDI or WAV files but it depends of the JavaScript features and can't be garanteed.

Any state caused by the MONITORING is followed by the REASON if it is not NORMAL state; the REASON and TIME OF EXISTANSE are showen on the different views.

1.4 The data collected.

The system collects this data about the monitoring objects:

For the router:

  1. The router status:
  2. Uptime of the router
  3. The processor load of the router, %.
  4. Free memory (total free memory, it could be the summ of a few types in dependence of the router type).
  5. The temperature inside (if it is measured).
  6. The busy and free MAIN memory;
  7. The busy and free IO memory;

For the channel (interface):

  1. The state of interface:
  2. The status of the INPUT interface line:
  3. The status of the output interface line:
  4. Transmit errors and receive drops show often the lack of some resources. The transmit drops show usially the overloaded link or are result of the traffic shaping or rate-limiting (in case of CAR or ATM PVC).

For the BGP connection.

  1. Time of BGP status
  2. The BGP status

This monitor revision does not use an information about BGP connections except _FULL_ screen describing the full information derived from the router.

Now let's show on the example - the screen describing the full router status (including the channels (interfaces) and BGP connections).

Table 2. FULL router screen.

Wed Dec 3 23:48:50 1997 1310[TOTAL]
M9-8 cpu: 56%(17S) U 9d3h 56% 8.9M
Se0 turbo


U 17.7% 112.8 p/s 0.0% err 29.4% 96.5 p/s 0.0% drps
Se1 sakhml



U 7.0% 26.1 p/s 0.0% err 2.1% 2.4 p/s 0.0% drps
Se2 rich(1)



U 18.8% 9.6 p/s 0.0% err 72.5% 7.0 p/s 45.1%drps
Se3 rich(2)



U 18.4% 9.7 p/s 0.0% err 74.3% 7.3 p/s 43.0%drps
Se4 rich(3)



U 19.0% 9.7 p/s 0.0% err 66.7% 6.6 p/s 47.8%drps
Se5 rich(4) DOWN(2d3h)
Se6 pgts



U 0.3% 0.7 p/s 0.0% err 2.6% 0.8 p/s 0.0% drps
Se7 rpac128



U 2.9% 0.0 p/s 0.0% err 17.6% 0.0 p/s 0.0% drps
Se8 gts_1



U 0.6% 0.4 p/s 0.0% err 0.2% 0.4 p/s 0.0% drps

First line of this table show the state of the router itself:

The following lines show us the channels (interfaces) and the BGP sessions. For example, analyse the line describing rich(1):

2. The call of monitor and it's customisation.

First, you should open the start windown of the monitor. Usially, it is 'http://your_server:8100/M' url for the operator's interface, and 'http://your_server:8100/U' for the link owner.

Then system allow you to choose and open one of a few different views of this system. To make this selection, you should understand what windows exist and what does they mean.

The system use 5 different windows:

The system propose you a few pre-defined window locations, and first you see the starting screen which ask you to choose one of the window locations. This screen looks so:


edit M/bin/p_index.pllMONITOR: [KOI-8] [WIN] [HOME] [INFO] [ADMIN] [PUBLIC] [LINKS] [LINKS INT]
The monitoring for IP network [òäó], Rev. 1.2.
The view of monitor Call
Work with the reports CATALOG
Small window, menu in new windows
Screen with the menu and main monitor
All in one window
Where to show: Audio signals:
Guides: [Guide] [Reglament]

MONITOR: [KOI-8] [WIN] [HOME] [INFO] [ADMIN] [PUBLIC] [LINKS] [LINKS INT]

To open monitor screen (remember - we are talking about the screens only, the monitor daemon 'snmpstatd' must run in the backgroung always), you should:

  1. Select sound mode by the first button ('music' if you have MIDI plugin, 'sound' if you have WAV/AIFF plugin. none if you have not any). If you use 'netscape' javascript program attempt to make the propriate default for this button.
  2. Select the window for the monitor - by default it should be a new window, but you can install the same window by the button on the bottom of this screen. Don't use it if you choose _SMALL_ menu form.
  3. Call monitor by one of a few buttons, selecting the form of the main window: small (SUMM window only, usefull if you want to see the important errors only), AllInOne - everything on the same big window; usefull if your system is not too big; and the third buttol select open MENU, SUMM screens and MONITOR screen on the same big window, but opens ROUTER and LINK windows separately;
  4. To open this screens, the language JavaScript is used. You can create your own frame manually or even call any screen directly, if you call the cgi script http://your_server:8100/M/C/ASK.cgi?op=operation&SND=midi, where
    operationscreen
    PAGE.frame_allAll in one screen
    PAGE.frame_smallSmall menu
    TOTALTotal network view
    ALARMAlarms only
    We do not recommend to use this operations directly, except PAGE.frame_small operation (which can be embedded in different complex frames).

3. The usage of the monitor. Main screens.

3.1. A few notes about the references and the buttons.

This monitor use standard HTTP technology. Almost all object names, just as an interface names, and menu bars on the screens are the html references and opens new screens (in the same or another window) when you click into them. As usial, you can always choose new window for any reference by the middle mouse button (in case of 3-button mouse) or by pressing the right button and selection from the menu.

Remember that, if the reference you choose usially opens in the new external (and named) window (such as LINK window), and this window (1) exists and (2) minimized, you (in dependence of your OS) have a chance don't see the new document at once, you should found and open the minimized window first.

3.2. Usage of the monitor.

When you work with the monitor, almost all usefull information can be shown in the MONITOR screem. It should be the TOTAL or ALARM network views, just as the network snapshort or the system journal.

There is 3 types of the screens (frames) in the monitor. First type are those screens which are refreshed periodically - this are SHORT, TOTAL, ALARMS screens called by the main menu buttons. This screens are refreshed every 30 or 60 seconds, and are previously prepared by the mon_daemon (which update the html files every 30 seconds). This views are the main network views, but they show the view with some (30 - 40) seconds delay, because they are not calculated _on the fly_ due to the performanse reasons. Through, 'snapshort' view is built just when it is called, on the fly.

The second type of the picturs is the router pictures showed in the ROUTER window (or the frame). This views are calculated on the fly and refresh every 30 seconds (if you don't use T=time parametr). Do not run too many such views in a time - you have a chance to overload http server.

The third type - static views and menus, they are calculated on the fly but do not refresh automatically.

Every refreshed screen have an information about the time and status number when it was calculated, they are shown on the top of the screens TOTAL, ROUTER, ALARM and simular. In case of the troubles (for example, snmpstatd is dead of mon_daemon freese) you can see the valuable difference between the current time and the time of this status.

So, if you press to the total or alarms button, you'll see this screen (below is a very simple example of it):

3.3. An example of the main screen.

Table 3. ALARMS view.

Thu Dec 4 23:45:40 1997 1452[ALARMS]
RTK-M9-2 cpu: 60%(1h51m)
DAWN-1 35d2h
As1 infoc



o/drops: 5.6%(4m0s)
Se0 fapsi-h DOWN(1h53m)
DAWN-2 10d2h
As12 mupitt



o/drops: 22.0%(5h58m)
KIAE-5 10d1h
Se1 iter-nikimt LOOP(6h3m)
KIAE-8 91d13h
As1 niias DOWN(3h41m)
M9-1 8d21h
sl2 bashin



o/err: 6.6%(4m20s)
M9-12 cpu: 59%(11m52s)
As15 comtat(3)



o/drops: 5.1%(4m20s)
As3 svyaz



o/drops: 9.7%(40m23s)
As4 gts



o/drops: 8.8%(28m21s)
As6 comtat(2)



o/drops: 13.6%(1h52m)
As7 businf(1)



o/drops: 11.3%(22m23s)
As8 businf(2)



o/drops: 14.7%(16m22s)
M9-3 24d2h
sl15 rospac DOWN(12h17m)
sl5 tixm DOWN(10h32m)
M9-4 cpu: 66%(13h34m)
As1 ktts



o/drops: 43.9%(22h46m)
As3 izhmar(1)



o/drops: 21.3%(1h34m)
As7 izhmar(2)



o/drops: 23.0%(1h34m)
M9-5 10d4h
Se0:1 ibank1



o/drops: 12.5%(2h16m)
M9-6 cpu: 77%(8h36m)
As1 kosnet(1) DOWN(5S)
As13 cclearn(2)



o/drops: 27.1%(4m34s)
As2 relinfo(3)



o/drops: 7.8%(1h28m)
As3 relinfo(2)



o/drops: 11.2%(40m34s)
As4 kosgts



o/drops: 7.9%(6h46m)
As6 cclearn(1)



o/drops: 25.3%(4m20s)
As7 innet(1)



o/drops: 12.2%(16h10m)
As8 innet(2)



o/drops: 9.7%(16h10m)
As9 relinfo(1)



o/drops: 12.0%(40m33s)
Se1 kazna DOWN(22h52m)
M9-7 10d2h
As5 infors(1) DOWN(10d2h)
Se0 aha-M9 DOWN(3h31m)
M9-8 cpu: 67%(18h36m)
Se2 rich(1)



o/drops: 40.7%(22h46m)
Se3 rich(2)



o/drops: 34.9%(22h46m)
Se4 rich(3)



o/drops: 40.9%(11h40m)
Se5 rich(4) DOWN(3d3h)
M9-9 16d8h
Se0/1:21 processor



i/err: 35.6%(28m40s)
Se0/1:23 rich-512 DOWN(6h31m)
spb-relarn-1 1d10h
As5 iephb



o/drops: 8.0%(4m57s)

First line have the time of monitoring when tis status was calculated. Note - this is not the time when this screen was build, but the time for which this data are actual.

There is one or a few columns with the objects description below. The formayt of this description was defined already (above), with some shortages:

Operator can open the detail description (and additional menu) for every object:

  1. To open detailed view about the router, click to the router's name, the example of such view was shown in the table 2 (in the 1.2 revision this screen was merged with the router-menu screen, described below).
  2. To open the journal, statistic, graphics and reports about the router (router menu), click to the status of the router (this screen was joined with the previous one in revision 1.2).
  3. To open CHANNEL menu (with the journal, report, graphics and other menus) click into the link name on the TOTAL, ALARM or ROUTER view. This view should be open into the LINKS window.
  4. To show at the current interface status (just as 'show interface' on the CISCO router) for any channel, click into the interface (not channel, very left column) name on this views (TOTAL, ALARM or ROUTER). This call 'rsh ROUTER sh interface' request, and (if you allow it on your CISCO) you show the 'show interface' output at the LINK window.

3.4. Full view of the router.

An example of this output was shown in the Table 2 above. In the new (1.2) revision this view differ slightly by an extra menu bar on the top.

3.5. Journal and statistic about the ROUTER or about the CHANNEL.

For any (ROUTER, CHANNEL) object you can open menu bar with the different buttons, to show graphs, reports, journal records, accounting archive, and so on. To call this menu for the channel, click on nthe channel name in any window. To call this for the router, click on the router's name or choose the router in the SUMM window.

This menus looks simular, below is an example (for the channel):

Channel: svyaz 
zoom journal graphs report card archiv

On the top line of this menu there is an object type (channel), object name (svyaz) followed by the data of the requested accounting. The + and - buttons (not shown on the sample) around this field can be used to change this data forward and backward, or it can be typed into this field directly. You can request monthly accounting instead of daily if enter year and month directly into this field in the form YYYY.MM.

The second row contain menu bar, with this buttons:

  1. graphs - show the graphics about this object (in case of the channel it should be utilisation /input and output/, packets per second /input and output/ and average packet size).
  2. zoom - the same as the graphs but of the increased size;
  3. report - text (table and text) report should be shown, just with the journal records concerning this object; this report include per-hour link utilisation (average and max), link down time, errors and drops.
  4. card - the data base record describing this object. This is LINKS data base, and distribution kit contain only very simple version (file-based) of this system.
  5. archiv - accounting archive; you can found the data for any day here withouth changing 'date' field (above) manually.
  6. journal - system journal, with the ticket system. See distinct description below.

For the router object, additional buttons appear:

The graphs concerning router need an extra description:

Router: M9-12 
zoom journal graphs report card archiv enter config

M9-12 (1997.12.05)
Load CPU
Hours
0 100
0





























6





























12





























18
































Load %
(overload) %
Failures
Busy memory (from 16000)
Hours
0 16000
0





























6





























12





























18
































Busy memory K
(last MB) K
Failures

First graph show the CPU utilisation (%); extra high (> 70%) utilisation is shown by the yellow color;

The second graph show the memory usage. This is slightly upside-down graph - it show really free memory, not busy one because just free memory is of any interest.

If your router allow, the extra 2 graphics are shown - the processor memory and the IO memory (are absent on our sample).

Blue marks mark router (and channel) failures.

3.6. Journal and operator's comments (tickets).

The system journal is of the great impoirtance in this monitoring system. It is the set of the few journals where all messages and notes written by the operators are stored (one journal for every object, and one daily jornal for every day). In addition, the comments (tickets) system is joined with the journal system, this system store the tickets used to change the object status temporary or permanently. To call the journal, click on the [journal] button:


Router: M9-12 
zoom journal graphs report card archiv enter config
Object Name Status Reason Duration
Router: M9-12 Overload cpu:; 1h28m

Comment to the current status()

Set up:
To the time: dayhourmin.
Remove when state return to normal?
Comment:

1997.10.31 20:49:47 unknown deleted comment for R.MSK-M9-12
1997.10.31 20:49:31 unknown created comment for R.MSK-M9-12:
(W1->U2)


This is one of the main operator's tools. It allow:

First table describe the current object state, after the Permanent comment was applyed to it, if such comment exist - for example, if there exist Permanent comment replacing state Failure to the state Important failure, just this, last (Important failure) state will be shown here.

Then the list of existing Permanent comments will be shown (absent in the example above).

Then follow the list of existing Comments to the event, and (if there is not any comment corresponded to the current state) the empty form for the creating such comment.

The last part consist of the object journal (it's form depend of the monitor revision and can differ slightly from the documented here).

Let's show the Comment to the event in details (starting from the 'comment to the event' title):

First raws (on the white background) contain the ticket header, and (usially) should not be changed. The fields here defile the object type, object name and object status for which this ticket should be applied, just as the ticket type. If you want to create the ticket concerning the current object and the current object state, don't change this fields. Through, the system admin must change the type of the ticket to the 'Permanent comment' to create the 'Permanent comment'. Of course, you can create ticket concerning any (not current) state and even any (not current) object, through it is not recommended. All this 'on the white background' fields describe the 'starting state' for which this ticket must be applied.

The next ticket part describe which state should be installed instead of the starting one. The rules of installation object states instead of initial ones should be defined in the OPERATION GUIDE guide and depends of the company profile and other non-technical issues. The most common idea is to cause operators comment any RED event (E1, E4 and so on failures) after initial investigation, to remove any RED - colored alarms from the alarm screen. You can change the state descriptions by editing configuration file, and use your own states and state names.

The next raw determine the expiration rules for this ticket. First, you can limit the time of this ticket by few days, hours or minutes, and we highly advise to do it always when you do not want the ticket to be set forever. When the time installed in this fields is exceeded, the ticket will be removed authomatically.

Next button determine if this ticket must be removed when an object restore it's normal state, or not. Use 'yes' answer always when you are do not suspect a numerous sequential failures of the object.

If you create the ticket withouth the expiration time and withouth 'yes' in the 'Remove when state return to normal' button, this ticket will be stored in the system forever, until you remove it manually. This mode is not recommended for the often usage.

Next is the 'Comment' field, for the operator's comments and other information. We recommend to fill in this field ALWAYS.

In the 1.2 revision there is additional button raw defining if this ticket should be sent to the NOC staff, LINKS staff and/or to the link owner (if his e-mail address is available from the LINKS data base). It was not shown on our sample.

Last button raw define the operation about the ticket - create, remove, change it. The button 'Journal record only' allow you to make the journal record withouth creating the ticket in the ticket base, and (consequently) withouth changing an object state.

To edit or remove the 'Permanent comment', you should install this type, just as the starting state, in the ticket form, and then fill in any other ticket fields.

3.7. The system journal.

There is many journals in the monitoring system. Every monitoring object (except BGP for 1.2 version) have it's own journal; in addition there is common system journal splitted by per-day basys to the small day files. When you write something (create or edit ticket, or create journal record) about the object, system add records into the boths _object_ and _system_ journal. It allow to get all journal records for any object for all time of it's existance, or to read all journal records for the any operational day.

To open the system journal, click on the [journal] button in the main menu. Below is an example of such journal:

Journal for  
view write ÓÙÒÏÊ ÆÏÒÍÁÔ

Type Object Record Was Became Who Where
Channel: nipltd Channel SviazProekt down from 12:35, investigating by Maksimuchev U. The contact persom Zverinskii Semen, 943-1293 Failure kiaed 1997.12.03 16:19:23
Channel: rich-512 Wrong FCD-2 modem, will be changed tomorrow (on the customer's end). Overload In debug alex 1997.12.03 16:00:40
Channel: M9-11-KIAE Operators are searching for the seniour engeneer. Important failure eap 1997.12.03 05:26:42
Channel: ttsv(3) Was resetted few times, no any results. Failure eap 1997.12.03 05:22:15
Channel: ttsv(2) Modem reset by the customer's request. Failure eap 1997.12.03 05:15:46
Channel: acc GL informed, problem fixed. Failure eap 1997.12.03 00:25:45

This records duplicate an object journal records. IN addition, there is possible to add any independent record here, withouth opening some object, by button write. In addition, the object name in the journal is clicable and search this object in the monitoring.

3.8. LINKS data base.

LINKS data base is slightly out of scope for this guide, because it should be big informational data base about all customers, providers, point of presense and devices for the particular company. The interface provided by the distribution pack allow to create, search and edit records describing channels and routers (monitoring objects), just as providers, reglaments and so on. To call this data base, you can use Search button of the main menu, or the card button of the object meny. Below is an example of the Search screen:

Searching system LINKS

Field name (optional)

Select DB and type in field value

4. The conclusion.

This guide described the monitoring system revision 1.2. A lot of rules should be defined by the OPERATION GUIDE, which depends of your marketing rules and company profile. Through some recomendations are common for any company:

5. Additional guides: