Information for Server/network failure updated 26-11-07 If I can't be contacted and the dept. network or servers fail, please give this information to anyone authorised to diagnose and fix. There are hard copies of this in fireproof backup box and on wall in Wolfson House G35. Thanks Warwick ---------------------------------------------------------------------------- Contacts and Info for problems with Dept Phon & Ling servers/network 18/07/07 ----------------------------------------------------------------------------- Contact: Warwick Smith 020 7679 7430 w.smith@ucl.ac.uk OR if not available Contact: Dave Cushing 020 7679 7400 d.cushing@ucl.ac.uk Steve Nevard 020 7679 3156 s.nevard@ucl.ac.uk OR if not available Contact: Mark Huckvale 020 7679 5002 m.huckvale@ucl.ac.uk Hans Van de Koot 020 7679 3165 h.v.d.koot@ucl.ac.uk If none of the above are available:- For Dell server and attached peripherals (bell1, bell2, wave) hardware problems:- Contact: Dell Local Government and Education Customer care phone: 01344 373 199 or Dell technical Support phone: 08709 080 500 quoting serial number of bell1 or bell2 listed below and printed on computer. Under a Dell Maintenance contract hardware support is available via phone or email plus 4 hour on site replacement of faulty hardware. The person coming to replace faulty hardware may not be able to help with related software problems. Software Problems:- bell1 ----- Phone support is avaialble from LinuxIT 0117 905 8718 (www.linuxit.com) Web/email support from Redhat https://www.redhat.com/apps/support/ For web username and password see attached sheet on printout of this document in backup box. This gives web support, knowledge base, and email support. bell2 ----- No software support apart from academic subscription to get updates from Redhat. Phone support can be obtained from LinuxIT covering bell2 if the problem would also apply to bell1. wave ---- Applications server wave is currently covered for software and hardware support from Dell. Contact phone numbers of Dell above. Server serial numbers (Called service tags by Dell) --------------------- bell1: 3CFWX1J bell2: 4CFWX1J wave: BG4PR2J Server model numbers -------------------- bell1 and bell2: PowerEdge 1800 wave: PowerEdge SC440 Operating system used (bell1 bell2 wave) --------------------------------------- Redhat Enterprise Linux 4 64bit (RHEL 4) Server Location --------------- bell1 and bell2: room B18 Basement Wolfson House wave: currently G29 Wolfson House. Planned to be in room G33 Wolfson House. In addition, documentation for relevant Dell hardware information ie for Dell Power Edge 1800 servers hardware is available on www.dell.co.uk, Support and Help, Servers and Storage, Power Edge. There is also information here on operating systems relevant to this hardware (RHEL 4 Linux). In addition, full software documentation for the Server Operating System (Redhat Linux RHEL Enterprise Linux 4) is available at www.redhat.co.uk, support, Documentation, Redhat Enterprise Linux, Redhat Enterprise Linux Administration Guide. The knowledge base area under support can be accessed For username and password see sheets "Information for Server/network failure" in backup box The Dell Maintenance hardware contract covers failure of the systems when used within the recommended environment. Systems stolen or physically damaged by the environment are not covered. In these cases, replacements need to be purchased from Dell. For all other problems:- Contact: UCL Information systems (IS) www.ucl.ac.uk/is General Enquiries to UCL Information systems ---------------------------------------------- Helpdesk 37779 Network Problems ---------------- Network Group:- Operations: 37350 Section head: Michael Turpin 37828 Group Manager: Andrew Kerl 37344 Pauline Swindells to locate people: 32359 E-mail ------ Non routine technical problems: Adrian Barker 25140 Routine problems: postmaster@ucl.ac.uk Restoring Backups ----------------- Steven Bridge 25149 operating systems group Paul Hajisavvi operating systems group General advice on Unix etc -------------------------- Contact: UCL IS operating systems group. Security, hacking etc --------------------- Marion Rosenberg CERT 32434 / 37388 ------------------------------------------------------------------------ ------------------------------------------------------------------------ THE INFORMATION GIVEN BELOW MAY BE NEEDED BY DELL OR UCL I.S. IF CONTACTED, OR ANYONE ATTEMPTING TO DIAGNOSE AND FIX PROBLEMS. Most of the following information concerns server configurations not supplied by Dell or Redhat Linux as a standard package and therefore anyone diagnosing problems needs to be aware of it. The server system comprising servers bell1 and bell2 in room G33 Wolfson House ------------------------------------------------------------------------------ SERVICES PROVIDED BY THIS SERVER SYSTEM --------------------------------------- websites: www.phon.ucl.ac.uk www.actl.ucl.ac.uk www.londonling.ucl.ac.uk www.enhance.phon.ucl.ac.uk/ Networked disks and printing via samba server with samba name bell Email for addresses @phon.ucl.ac.uk and @ling.ucl.ac.uk used for mailing lists eg staff_wh@phon.ucl.ac.uk. All users email addresses in these domains are now forwarded to these users' @ucl.ac.uk addresses independently of this server. ssh and sftp server anonymous ftp server ftp.phon.ucl.ac.uk OVERVIEW OF OPERATION --------------------- Both servers run continuously and each is supplied mains power from its own external APC 750 uninterruptable power supply (UPS). If the mains power fails the UPS battery will power the server for 5-10 minutes before going flat. The UPS is to protect against brief power cuts. bell1 is the master server which normally provides services such as web, email, networked disk access, ftp etc. If it or the network connection to it fails then bell2 takes over providing the services. Connections are made to the services via the network address 128.40.52.16 whose associated names are bell, mail, mail1, mail2 (.phon.ucl.ac.uk), www.phon.ucl.ac.uk, www.enhance.phon.ucl.ac.uk ftp.phon.ucl.ac.uk. This address and its names are assigned normally to bell1 but alternatively to bell2 if bell1 fails. If at any time bell1 comes back on line after having gone down, then it will take back the services from bell2 which will go into standby mode. Both servers are accessible respectively via ssh as bell1 or bell2 whether in standby mode or providing services. No changes should be made to files on the inactive server (normally bell2) as its files are continually kept in sync with the active server (normally bell1). Changes should therefore be made to files only when connected to bell as this will guarantee you are connected to the active server. The use of names bell1 and bell2 for connection should be only for diagnostic purposes. WARNING MESSAGES SENT OUT IF SERVICES MIGRATE FROM ONE SERVER TO THE OTHER -------------------------------------------------------------------------- If for example the normally active server bell1 failed then a message would be mailed from the standby server bell2 as it took over the services. This message is mailed to selected people and would read:- To: ha_warn@phonetics.ucl.ac.uk Subject: Resource Group Takeover in progress on bell2.phon.ucl.ac.uk Resource Group Takeover in progress on bell2.phon.ucl.ac.uk Command line was: /etc/ha.d/resource.d/MailTo ha_warn@phon.ucl.ac.uk start If bell1 was restored bell2 would send out the following message and bell1 a similar message to above as it took back services. To: ha_warn@phonetics.ucl.ac.uk Subject: Resource Group Migrating resource away from bell2.phon.ucl.ac.uk Resource Group Migrating resource away from bell2.phon.ucl.ac.uk Command line was: /etc/ha.d/resource.d/MailTo ha_warn@phon.ucl.ac.uk stop The list of email addresses getting these messages are in file /etc/aliases on server bell1 for mailing list ha_warn. DIAGNOSING PROBLEMS ------------------- To determine which server is the active server, log in to bell.phon.ucl.ac.uk using ssh and at the Unix prompt, type the command: active_server. The message:- This server, bell1.phon.ucl.ac.uk is currently active for services should appear. If the message:- This server, bell2.phon.ucl.ac.uk is currently active for services appears, then bell1 or the network connection to it has failed and should be fixed as soon as possible as bell2 does not provide nightly backups, changed files may be out of sync on one server, and to ensure that a standby server is available. KEEPING THE FILES ON EACH SERVER IN SYNC ---------------------------------------- Email inboxes (now used only for some system mailboxes) and hansiebase data is very time critical data so is synced using DRBD (Distributed Replicated Block Device). This is not part of RedHat Linux and there is no maintenance contract covering it. For details see www.drbd.org. The following are mounted on the active server dev/drbd0 5281400 313128 4699984 7% /share1 (email) dev/drbd1 5281400 57656 4955456 2% /share2 (hansiebase) Unmounted disk devices dev/drbd0 and dev/drbd1 on the inactive server are almost instantaneously kept in sync via a high speed dedicated ethernet cable between bell1 and bell2. This can be thought of as network RAID. All other files and directories which need to be kept in sync are updated every 10 minutes by a script using the rsync command and executed by the cron system. These are: /web /home /btemp /backup /bdata /share3 and files in /etc passwd shadow group gshadow aliases, web configuration files in /etc/httpd/conf, samba configuration files and samba password file in /etc/samba and all of /usr/local. Files which are excluded from syncing are listed in /usr/local/etc/ha_sync. SYNCING FILES BETWEEN SERVERS BEFORE SHUTTING DOWN THE ACTIVE SERVER. -------------------------------------------------------------------- This is not yet automated so needs to be done manually The active server should be shut down as root with the command:- ha_mirror_all -replicate; init 0 This ensures that any files changed since the last sync are updated on the other server which will automatically take over services. ********CAUTION! WHEN BOOTING BELL1********** --------------------------------------------- First log in to bell2 as root Type: date to display exact time Ensure that:- The command to boot or reboot bell1 is given as close as possible to 1 minute past the hour or one minute past any multiple of 10 minutes past the hour, eg 9:11am, 9:21am, 9:31am 9:41, 9:51 etc. This will then ensure that when bell1 has completed booting up there is time for the action described below to be taken before bell1 updates files on bell2. Otherwise file changes or new files created on bell2 when it was the active server will be deleted or overwritten with the older versions by bell1. Then take this action:- Make bell1 boot up. eg: If it is hung and nothing can be typed:- Try pressing the reset button on the front panel. If this does nothing then switch the output of its UPS off then on so the mains power to bell1 is interrupted. or If there is a unix root prompt type init 6 for a reboot When bell1 has booted and the login prompt appears (this can take a few minutes with pause up to 40 seconds after "nash" displayed) Immeadiately log in to bell1 as root. type: service heartbeat stop (this makes bell2 the active server) On bell2 at root prompt type: ha_mirror_all -replicate (this updates any files on bell1 changed or created on bell2 when it was the active server) When the Unix prompt reappears on bell2, then on bell1 immediately type the command: service heartbeat start (this makes bell1 the active server) MAKING A SERVER THE ACTIVE SERVER --------------------------------- bell1 is normally the active server and the server system will default to bell1 the active server if it and its network connection are operating correctly. If it fails then bell2 becomes active until bell1 comes on line again. Which server is active is controlled by the heartbeat program. This is not part of RedHat linux so is not covered by maintenance contract. For information see www.linux-ha.org. The standby normally inactive server, bell2 can be made active by stopping heartbeat running on bell1. Files should be synced first between servers. To make bell2 active instead of bell1:- On bell1 as root type:- ha_mirror_all -replicate; service heartbeat stop To make bell1 active again:- on bell2 as root type:- ha_mirror_all -replicate then immediately on bell1 type:- service heartbeat start CURRENT ISSUES WITH SERVERS BELL1 AND BELL2 ------------------------------------------- 1. Network Printing The print server for a particular printer sometimes disables itself if the printer hangs for a long time. To see if this has happened:- Login to bell as root type: lpc at the lpc prompt type: status scroll through the list of printers to see if any are labelled as disabled. type: quit For any disabled printers type:- cupsenable printer_name This problem is often caused by a print job in the print queue which hangs the printer. If this is suspected remove this print job from the queue:- as root on bell type: lpq -Pprinter_name note down job number type lprm -Pprinter_name job_number switch printer off then on type: cupsenable printer_name if it has again become disabled to remove all jobs from the queue type: lpr -Pprinter-name - 2. Disk quotas If a server reboots or becomes active after being inactive then quotas for email and hansiebase are not automatically enable. This causes error messages if anyone at the unix prompt types quota or show_quota to see their disk quotas. These quotas are not re-enabled until the following early morning. To fix this, at root prompt type:- quotaon -vu /share1 quotaon -vu /share2 Disk devices to disk partition names are not mapped in output from quota command. They are:- /dev/mapper/VolGroup01-LogVol01 /home (formerly users) /dev/mapper/VolGroup01-LogVol05 /backup (for PCs) /dev/mapper/VolGroup01-LogVol11 /web (all web servers) /dev/drbd0 /share1 (mail inboxes) /dev/drbd1 /share2 (hansiebase ) They are mapped in the show_quota command. 3. Occasional Freeze up of bell1. This may be caused by a kernel panic or a hardware problem or disk over temperature. The servers are configured to dump kernel panic to bell2 into /var/crash if a kernel panic on bell1 occurs. bell1 has been configured to reboot itself if a kernel panic occurs. Dell have examined the software installation and can find no fault with it. The system board and power supply have been replaced. The cause has still not been identified. If bell1 shuts itself down, Dell technical support should be contacted on 08709 080 500 and case ID number: 514641046 should be quoted. 4. Disk drive Temperature Because of inadequate air flow in the server room, disk drive temperatures on hot days may may rise to undesirably high levels. The maximum operating temperatore for the five internal disks is 55 degrees C. There is a thermal cutout on the disks which trips at 65 degrees C. Disk drive temperatures for bell1 only are are recorded every 15 minutes and can be viewed at http://www.phon.ucl.ac.uk/home/dept/computing/drive_temp.txt An email is sent to selected people every 20 minutes to selected addresses if the temperature of any disk is recorded as over 54 degrees C. All doors should be openned to the server room to increase air flow if this temperature is reached. The addresses this is sent to are in file /etc/aliases on bell1 against aliases drive_temp_warn. These can be removed or added to. When the aliases file is saved, the command newaliases must be typed as root for the saved file to become active. BELL and WAVE SERVER BACKUPS, DISK INFORMATION AND BOOT CD ------------------------------------------------- These are in fireproof box G35 Wolfson House. Restoring Backups ----------------- For bell servers:- RESTORE THE FOLLOWING OR ANYTHING UNDER THEM TO THE ACTIVE SERVER ONLY (normally bell1). They will be automatically copied to the inactive server (normally bell2):- /web /home /btemp /backup /bdata /share /share2 /share3 /etc/passwd /etc/shadow /etc/group /etc/gshadow web_conf smb_conf /usr/local /etc/aliases /etc/httpd.conf For bell1 bell2 and wave servers:- All disk partitions (except /tmp) are backed up to the UCL IS backup server nightly. To restore from this backup the dsmc command is used when logged in as root (su). This can only be used if the system will boot from diak to multi user mode, there is a network connection to the backup server and the tivoli backup software is installed in /opt/tivoli. A backup is made every few months to local DAT tape for bell1, bell2 and wave of the system partitions /(root) /usr /var /boot. These backups are kept in the fireproof backup box in room G35 Wolfson House. To restore from this backup the restore command needs to be used with the DAT tape with the latest backups in the DAT drive on the system. This restore needs to be used if system software is damaged so preventing the dsmc command from working. In addition, a copy of /(root) /usr /var for bell1 and bell2 are kept on normally unmounted partitions. See section below for details headed:- "Restoring root and system files if won't boot from boot disk" Use of the dsmc command to restore backups from UCL backup server ----------------------------------------------------------------- Log in as root To restore a file to the place on disk it was backed up from eg the file /etc/aliases dsmc restore /etc/aliases -latest To restore /etc/aliases to /btemp dsmc restore /etc/aliases /btemp/aliases -latest To restore the whole of the directory /web/www_phon/dept including any sub directories to /web/www_phon/dept. dsmc restore /web/www_phon/dept/"*" -subdir=y -latest An interactive method is also available to select files and directories for restore on screen using the -pick option. eg to select parts of /etc for restoring dsmc restore /etc/"*" -subdir=y -inactive -pick A list will appear and a menu allowing files/directories to be selected then restored. Files/directories which have been deleted from local disk are labelled I, versions on disk are labelled A. For more help with dsmc command, type dsmc then ? at the tsm prompt. To restore any system files and/or directories from DAT tape:- -------------------------------------------------------------- Only bell1 and bell2 have a DAT tape drive. To restore to wave, its DAT backup tape is put in drive on bell2 and accessed from restore command on wave. See following section "Restoring from DAT tape to wave" The following assumes that the system is already booted from an attached disk. If booted from the Installation CDROM, the disk partition where files are to be restored must first be mounted on /mnt. Then cd to /mnt/restore_directory_name (see below under 'Failure of Root Filesystem') Place DAT 72 tape in bell DAT drive. At SU prompt, cd to directory on disk where top of tree to be restored should appear. eg to restore data on tape containing /usr backup: Ensure that /usr is mounted on the disk partition /dev/mapper/VolGroup00-LogVol02 by typing df. Type: cd /usr at Unix prompt. Type restore -ivf /dev/st0. Type ? at restore prompt to see restore options. Type ls to see directories available for restore. All directories directly under /usr should be listed. Type add