Information for Server/network failure updated 26-11-07 If I can't be contacted and the dept. network or servers fail, please give this information to anyone authorised to diagnose and fix. There are hard copies of this in fireproof backup box and on wall in Wolfson House G35. Thanks Warwick ---------------------------------------------------------------------------- Contacts and Info for problems with Dept Phon & Ling servers/network 18/07/07 ----------------------------------------------------------------------------- Contact: Warwick Smith 020 7679 7430 w.smith@ucl.ac.uk OR if not available Contact: Dave Cushing 020 7679 7400 d.cushing@ucl.ac.uk Steve Nevard 020 7679 3156 s.nevard@ucl.ac.uk OR if not available Contact: Mark Huckvale 020 7679 5002 m.huckvale@ucl.ac.uk Hans Van de Koot 020 7679 3165 h.v.d.koot@ucl.ac.uk If none of the above are available:- For Dell server and attached peripherals (bell1, bell2, wave) hardware problems:- Contact: Dell Local Government and Education Customer care phone: 01344 373 199 or Dell technical Support phone: 08709 080 500 quoting serial number of bell1 or bell2 listed below and printed on computer. Under a Dell Maintenance contract hardware support is available via phone or email plus 4 hour on site replacement of faulty hardware. The person coming to replace faulty hardware may not be able to help with related software problems. Software Problems:- bell1 ----- Phone support is avaialble from LinuxIT 0117 905 8718 (www.linuxit.com) Web/email support from Redhat https://www.redhat.com/apps/support/ For web username and password see attached sheet on printout of this document in backup box. This gives web support, knowledge base, and email support. bell2 ----- No software support apart from academic subscription to get updates from Redhat. Phone support can be obtained from LinuxIT covering bell2 if the problem would also apply to bell1. wave ---- Applications server wave is currently covered for software and hardware support from Dell. Contact phone numbers of Dell above. Server serial numbers (Called service tags by Dell) --------------------- bell1: 3CFWX1J bell2: 4CFWX1J wave: BG4PR2J Server model numbers -------------------- bell1 and bell2: PowerEdge 1800 wave: PowerEdge SC440 Operating system used (bell1 bell2 wave) --------------------------------------- Redhat Enterprise Linux 4 64bit (RHEL 4) Server Location --------------- bell1 and bell2: room B18 Basement Wolfson House wave: currently G29 Wolfson House. Planned to be in room G33 Wolfson House. In addition, documentation for relevant Dell hardware information ie for Dell Power Edge 1800 servers hardware is available on www.dell.co.uk, Support and Help, Servers and Storage, Power Edge. There is also information here on operating systems relevant to this hardware (RHEL 4 Linux). In addition, full software documentation for the Server Operating System (Redhat Linux RHEL Enterprise Linux 4) is available at www.redhat.co.uk, support, Documentation, Redhat Enterprise Linux, Redhat Enterprise Linux Administration Guide. The knowledge base area under support can be accessed For username and password see sheets "Information for Server/network failure" in backup box The Dell Maintenance hardware contract covers failure of the systems when used within the recommended environment. Systems stolen or physically damaged by the environment are not covered. In these cases, replacements need to be purchased from Dell. For all other problems:- Contact: UCL Information systems (IS) www.ucl.ac.uk/is General Enquiries to UCL Information systems ---------------------------------------------- Helpdesk 37779 Network Problems ---------------- Network Group:- Operations: 37350 Section head: Michael Turpin 37828 Group Manager: Andrew Kerl 37344 Pauline Swindells to locate people: 32359 E-mail ------ Non routine technical problems: Adrian Barker 25140 Routine problems: postmaster@ucl.ac.uk Restoring Backups ----------------- Steven Bridge 25149 operating systems group Paul Hajisavvi operating systems group General advice on Unix etc -------------------------- Contact: UCL IS operating systems group. Security, hacking etc --------------------- Marion Rosenberg CERT 32434 / 37388 ------------------------------------------------------------------------ ------------------------------------------------------------------------ THE INFORMATION GIVEN BELOW MAY BE NEEDED BY DELL OR UCL I.S. IF CONTACTED, OR ANYONE ATTEMPTING TO DIAGNOSE AND FIX PROBLEMS. Most of the following information concerns server configurations not supplied by Dell or Redhat Linux as a standard package and therefore anyone diagnosing problems needs to be aware of it. The server system comprising servers bell1 and bell2 in room G33 Wolfson House ------------------------------------------------------------------------------ SERVICES PROVIDED BY THIS SERVER SYSTEM --------------------------------------- websites: www.phon.ucl.ac.uk www.actl.ucl.ac.uk www.londonling.ucl.ac.uk www.enhance.phon.ucl.ac.uk/ Networked disks and printing via samba server with samba name bell Email for addresses @phon.ucl.ac.uk and @ling.ucl.ac.uk used for mailing lists eg staff_wh@phon.ucl.ac.uk. All users email addresses in these domains are now forwarded to these users' @ucl.ac.uk addresses independently of this server. ssh and sftp server anonymous ftp server ftp.phon.ucl.ac.uk OVERVIEW OF OPERATION --------------------- Both servers run continuously and each is supplied mains power from its own external APC 750 uninterruptable power supply (UPS). If the mains power fails the UPS battery will power the server for 5-10 minutes before going flat. The UPS is to protect against brief power cuts. bell1 is the master server which normally provides services such as web, email, networked disk access, ftp etc. If it or the network connection to it fails then bell2 takes over providing the services. Connections are made to the services via the network address 128.40.52.16 whose associated names are bell, mail, mail1, mail2 (.phon.ucl.ac.uk), www.phon.ucl.ac.uk, www.enhance.phon.ucl.ac.uk ftp.phon.ucl.ac.uk. This address and its names are assigned normally to bell1 but alternatively to bell2 if bell1 fails. If at any time bell1 comes back on line after having gone down, then it will take back the services from bell2 which will go into standby mode. Both servers are accessible respectively via ssh as bell1 or bell2 whether in standby mode or providing services. No changes should be made to files on the inactive server (normally bell2) as its files are continually kept in sync with the active server (normally bell1). Changes should therefore be made to files only when connected to bell as this will guarantee you are connected to the active server. The use of names bell1 and bell2 for connection should be only for diagnostic purposes. WARNING MESSAGES SENT OUT IF SERVICES MIGRATE FROM ONE SERVER TO THE OTHER -------------------------------------------------------------------------- If for example the normally active server bell1 failed then a message would be mailed from the standby server bell2 as it took over the services. This message is mailed to selected people and would read:- To: ha_warn@phonetics.ucl.ac.uk Subject: Resource Group Takeover in progress on bell2.phon.ucl.ac.uk Resource Group Takeover in progress on bell2.phon.ucl.ac.uk Command line was: /etc/ha.d/resource.d/MailTo ha_warn@phon.ucl.ac.uk start If bell1 was restored bell2 would send out the following message and bell1 a similar message to above as it took back services. To: ha_warn@phonetics.ucl.ac.uk Subject: Resource Group Migrating resource away from bell2.phon.ucl.ac.uk Resource Group Migrating resource away from bell2.phon.ucl.ac.uk Command line was: /etc/ha.d/resource.d/MailTo ha_warn@phon.ucl.ac.uk stop The list of email addresses getting these messages are in file /etc/aliases on server bell1 for mailing list ha_warn. DIAGNOSING PROBLEMS ------------------- To determine which server is the active server, log in to bell.phon.ucl.ac.uk using ssh and at the Unix prompt, type the command: active_server. The message:- This server, bell1.phon.ucl.ac.uk is currently active for services should appear. If the message:- This server, bell2.phon.ucl.ac.uk is currently active for services appears, then bell1 or the network connection to it has failed and should be fixed as soon as possible as bell2 does not provide nightly backups, changed files may be out of sync on one server, and to ensure that a standby server is available. KEEPING THE FILES ON EACH SERVER IN SYNC ---------------------------------------- Email inboxes (now used only for some system mailboxes) and hansiebase data is very time critical data so is synced using DRBD (Distributed Replicated Block Device). This is not part of RedHat Linux and there is no maintenance contract covering it. For details see www.drbd.org. The following are mounted on the active server dev/drbd0 5281400 313128 4699984 7% /share1 (email) dev/drbd1 5281400 57656 4955456 2% /share2 (hansiebase) Unmounted disk devices dev/drbd0 and dev/drbd1 on the inactive server are almost instantaneously kept in sync via a high speed dedicated ethernet cable between bell1 and bell2. This can be thought of as network RAID. All other files and directories which need to be kept in sync are updated every 10 minutes by a script using the rsync command and executed by the cron system. These are: /web /home /btemp /backup /bdata /share3 and files in /etc passwd shadow group gshadow aliases, web configuration files in /etc/httpd/conf, samba configuration files and samba password file in /etc/samba and all of /usr/local. Files which are excluded from syncing are listed in /usr/local/etc/ha_sync. SYNCING FILES BETWEEN SERVERS BEFORE SHUTTING DOWN THE ACTIVE SERVER. -------------------------------------------------------------------- This is not yet automated so needs to be done manually The active server should be shut down as root with the command:- ha_mirror_all -replicate; init 0 This ensures that any files changed since the last sync are updated on the other server which will automatically take over services. ********CAUTION! WHEN BOOTING BELL1********** --------------------------------------------- First log in to bell2 as root Type: date to display exact time Ensure that:- The command to boot or reboot bell1 is given as close as possible to 1 minute past the hour or one minute past any multiple of 10 minutes past the hour, eg 9:11am, 9:21am, 9:31am 9:41, 9:51 etc. This will then ensure that when bell1 has completed booting up there is time for the action described below to be taken before bell1 updates files on bell2. Otherwise file changes or new files created on bell2 when it was the active server will be deleted or overwritten with the older versions by bell1. Then take this action:- Make bell1 boot up. eg: If it is hung and nothing can be typed:- Try pressing the reset button on the front panel. If this does nothing then switch the output of its UPS off then on so the mains power to bell1 is interrupted. or If there is a unix root prompt type init 6 for a reboot When bell1 has booted and the login prompt appears (this can take a few minutes with pause up to 40 seconds after "nash" displayed) Immeadiately log in to bell1 as root. type: service heartbeat stop (this makes bell2 the active server) On bell2 at root prompt type: ha_mirror_all -replicate (this updates any files on bell1 changed or created on bell2 when it was the active server) When the Unix prompt reappears on bell2, then on bell1 immediately type the command: service heartbeat start (this makes bell1 the active server) MAKING A SERVER THE ACTIVE SERVER --------------------------------- bell1 is normally the active server and the server system will default to bell1 the active server if it and its network connection are operating correctly. If it fails then bell2 becomes active until bell1 comes on line again. Which server is active is controlled by the heartbeat program. This is not part of RedHat linux so is not covered by maintenance contract. For information see www.linux-ha.org. The standby normally inactive server, bell2 can be made active by stopping heartbeat running on bell1. Files should be synced first between servers. To make bell2 active instead of bell1:- On bell1 as root type:- ha_mirror_all -replicate; service heartbeat stop To make bell1 active again:- on bell2 as root type:- ha_mirror_all -replicate then immediately on bell1 type:- service heartbeat start CURRENT ISSUES WITH SERVERS BELL1 AND BELL2 ------------------------------------------- 1. Network Printing The print server for a particular printer sometimes disables itself if the printer hangs for a long time. To see if this has happened:- Login to bell as root type: lpc at the lpc prompt type: status scroll through the list of printers to see if any are labelled as disabled. type: quit For any disabled printers type:- cupsenable printer_name This problem is often caused by a print job in the print queue which hangs the printer. If this is suspected remove this print job from the queue:- as root on bell type: lpq -Pprinter_name note down job number type lprm -Pprinter_name job_number switch printer off then on type: cupsenable printer_name if it has again become disabled to remove all jobs from the queue type: lpr -Pprinter-name - 2. Disk quotas If a server reboots or becomes active after being inactive then quotas for email and hansiebase are not automatically enable. This causes error messages if anyone at the unix prompt types quota or show_quota to see their disk quotas. These quotas are not re-enabled until the following early morning. To fix this, at root prompt type:- quotaon -vu /share1 quotaon -vu /share2 Disk devices to disk partition names are not mapped in output from quota command. They are:- /dev/mapper/VolGroup01-LogVol01 /home (formerly users) /dev/mapper/VolGroup01-LogVol05 /backup (for PCs) /dev/mapper/VolGroup01-LogVol11 /web (all web servers) /dev/drbd0 /share1 (mail inboxes) /dev/drbd1 /share2 (hansiebase ) They are mapped in the show_quota command. 3. Occasional Freeze up of bell1. This may be caused by a kernel panic or a hardware problem or disk over temperature. The servers are configured to dump kernel panic to bell2 into /var/crash if a kernel panic on bell1 occurs. bell1 has been configured to reboot itself if a kernel panic occurs. Dell have examined the software installation and can find no fault with it. The system board and power supply have been replaced. The cause has still not been identified. If bell1 shuts itself down, Dell technical support should be contacted on 08709 080 500 and case ID number: 514641046 should be quoted. 4. Disk drive Temperature Because of inadequate air flow in the server room, disk drive temperatures on hot days may may rise to undesirably high levels. The maximum operating temperatore for the five internal disks is 55 degrees C. There is a thermal cutout on the disks which trips at 65 degrees C. Disk drive temperatures for bell1 only are are recorded every 15 minutes and can be viewed at http://www.phon.ucl.ac.uk/home/dept/computing/drive_temp.txt An email is sent to selected people every 20 minutes to selected addresses if the temperature of any disk is recorded as over 54 degrees C. All doors should be openned to the server room to increase air flow if this temperature is reached. The addresses this is sent to are in file /etc/aliases on bell1 against aliases drive_temp_warn. These can be removed or added to. When the aliases file is saved, the command newaliases must be typed as root for the saved file to become active. BELL and WAVE SERVER BACKUPS, DISK INFORMATION AND BOOT CD ------------------------------------------------- These are in fireproof box G35 Wolfson House. Restoring Backups ----------------- For bell servers:- RESTORE THE FOLLOWING OR ANYTHING UNDER THEM TO THE ACTIVE SERVER ONLY (normally bell1). They will be automatically copied to the inactive server (normally bell2):- /web /home /btemp /backup /bdata /share /share2 /share3 /etc/passwd /etc/shadow /etc/group /etc/gshadow web_conf smb_conf /usr/local /etc/aliases /etc/httpd.conf For bell1 bell2 and wave servers:- All disk partitions (except /tmp) are backed up to the UCL IS backup server nightly. To restore from this backup the dsmc command is used when logged in as root (su). This can only be used if the system will boot from diak to multi user mode, there is a network connection to the backup server and the tivoli backup software is installed in /opt/tivoli. A backup is made every few months to local DAT tape for bell1, bell2 and wave of the system partitions /(root) /usr /var /boot. These backups are kept in the fireproof backup box in room G35 Wolfson House. To restore from this backup the restore command needs to be used with the DAT tape with the latest backups in the DAT drive on the system. This restore needs to be used if system software is damaged so preventing the dsmc command from working. In addition, a copy of /(root) /usr /var for bell1 and bell2 are kept on normally unmounted partitions. See section below for details headed:- "Restoring root and system files if won't boot from boot disk" Use of the dsmc command to restore backups from UCL backup server ----------------------------------------------------------------- Log in as root To restore a file to the place on disk it was backed up from eg the file /etc/aliases dsmc restore /etc/aliases -latest To restore /etc/aliases to /btemp dsmc restore /etc/aliases /btemp/aliases -latest To restore the whole of the directory /web/www_phon/dept including any sub directories to /web/www_phon/dept. dsmc restore /web/www_phon/dept/"*" -subdir=y -latest An interactive method is also available to select files and directories for restore on screen using the -pick option. eg to select parts of /etc for restoring dsmc restore /etc/"*" -subdir=y -inactive -pick A list will appear and a menu allowing files/directories to be selected then restored. Files/directories which have been deleted from local disk are labelled I, versions on disk are labelled A. For more help with dsmc command, type dsmc then ? at the tsm prompt. To restore any system files and/or directories from DAT tape:- -------------------------------------------------------------- Only bell1 and bell2 have a DAT tape drive. To restore to wave, its DAT backup tape is put in drive on bell2 and accessed from restore command on wave. See following section "Restoring from DAT tape to wave" The following assumes that the system is already booted from an attached disk. If booted from the Installation CDROM, the disk partition where files are to be restored must first be mounted on /mnt. Then cd to /mnt/restore_directory_name (see below under 'Failure of Root Filesystem') Place DAT 72 tape in bell DAT drive. At SU prompt, cd to directory on disk where top of tree to be restored should appear. eg to restore data on tape containing /usr backup: Ensure that /usr is mounted on the disk partition /dev/mapper/VolGroup00-LogVol02 by typing df. Type: cd /usr at Unix prompt. Type restore -ivf /dev/st0. Type ? at restore prompt to see restore options. Type ls to see directories available for restore. All directories directly under /usr should be listed. Type add to mark those selected for restore or type just add to mark them all. This will make directories for restoring into. Type ls to confirm a * has been added against . Type extract to restore them. Type 1 if asked which volume to extract. After the restore type y if prompted with "set owner/mode for ." and y again if it says they exist, set anyway. Where a tape contains multiple dumps and other than the first dump needs to be extracted:- for example to extract the second dump on the tape restore -ivf /dev/st0 -s 2 type: what at restore prompt to confirm it displays the required dump. Then proceed as above to restore the dump. Or:- First wind the tape to the beginning of the required dump by typing at bell Unix prompt:- eg to wind to the second dump on the tape mt -f /dev/nst0 fsf 1 then restore -ivf /dev/st0 Restoring from DAT tape to wave ------------------------------- As for above except:- Place wave backup DAT tape in the bell2 DAT drive On bell2 as root login, edit the file /etc/ssh/sshd_config to change PermitRootLogin no to read #PermitRootLogin no At root prompt type the command: service ssh restart The restore is preceded by SSH=rsh and bell2:/dev/st0 is used instead of /dev/st0 in the restore command for example SSH=rsh restore -ivf bell2:/dev/st0 To restore the first dump on the tape into current directory The mt command is preceded by ssh bell2 for example ssh bell2 mt -f /dev/nst0 fsf 1 To wind tape forward to next end of dump marker. WHEN FINISHED RESTORING CHANGE #PermitRootLogin no BACK TO PermitRootLogin no IN /etc/ssh/sshd_config AND THEN AT ROOT PROMPT TYPE THE COMMAND: service ssh restart booting from boot cd --------------------- This is needed if bell1 or bell2 or wave won't boot from internal disk. The bootable CD for bell is CD number1 64 bit in the Redhat Enterprise Linux CD box. For wave use separate CD1 RHEL-U4-X86_64_ES as above CD does not contain drivers to see internal disks on wave. Put CD in drive Press power button on server When menu appears press F5 (rescue mode) Type: linux rescue select language English, keyboard UK Answer no to "Do you want to start the network interfaces" Select skip for next option This should result in a Unix prompt with limited commands available. type fdisk -l to see partitions on hard disks. To activate logical volumes so they can be seen on hard disks type:- lvm vgscan lvm vgchange -ay to list logical volumes on internal disks type:- lvm lvscan Any lvm commands can now be used if they are preceded by lvm for more information on lvm type at root prompt on a Redhat linux system lvm help or man lvm Restoring root and system files if won't boot from boot disk ------------------------------------------------------------ Boot from CD as above type fdisk -l to see partitions on hard disk. The disk arrangement uses logical volumes and volume groups on the partitioned disk as viewed by fdisk -l, lvm lvscan, lvm vgscan, lvm pvscan etc Volume group 0 contains the Operating system as installed by Dell Put backup DAT tape from fireproof box with latest date and for that server into DAT drive slot on front of server. Ask Dell/Redhat for advice on restoring the system from a backup. On the backup DAT there are backups sequentially along the tape in dump format of: root (/), /usr, /var, /boot Each logical volume for the above needs to be mounted onto a directory created under the booted CD with the mkdir command. Then the restore command is used to restore from DAT. See "To restore any system files and/or directories from DAT tape" above. If a disk needs to be re-formatted and re-partitioned, this information is on hardcopy sheets in fireproof box G35 Wolfson House. The steps for this are:- create partitions using fdisk command create physical volumes using pvcreate commands. create logical volumes using lvcreate commands. In addition, a backup of the logical volume configuration is automatically kept under /etc/lvm. This can be restored from nightly backups from the other bell server or a recent copy is kept on tar format on CD in fireproof box in with redhat linux CDs. It can be used to automatically create the logical volumes. Ask Dell/Redhat for advice. When the system is restored from DAT, other filesystems if necessary can be restored from UCL central backup server using the dsmc command. See section "Use of the dsmc command to restore backups from UCL backup server" above. Also, any files changed since the last DAT backup will need to be restored using the dsmc command. Files in /etc such as /etc/passwd /etc/shadow /etc/samba/password may have changed. Those needing to be restored can be determined with the command for example:- find /etc -mtime -n -ls Where n is the number of days since the date of the DAT backup or the last backup to the UCL IS backup server. Nightly Backups to the Central UCL system are performed from bell1 and bell2 and wave. For bell servers, non system files (all except those in /, /usr, /var) should normally only be restored only to bell1 from its backup as these are automatically mirrored to bell2 at 10 minute intervals. System files should only be restored from the backup of the matching server (bell1 or bell2) as some of these are different on each server wave has its own set of files except for /home which is mounted from bell. In addition, a copy of /(root) /usr /var for bell1 and bell2 are kept unmounted on the following partitions /dev/mapper/VolGroup01-LogVol06 / /dev/mapper/VolGroup01-LogVol07 swap /dev/mapper/VolGroup01-LogVol08 /usr /dev/mapper/VolGroup01-LogVol09 /tmp /dev/mapper/VolGroup01-LogVol10 /var and for wave on partitions /dev/mapper/VolGroup01-LogVol00 / /dev/mapper/VolGroup01-LogVol01 swap /dev/mapper/VolGroup01-LogVol02 /usr /dev/mapper/VolGroup01-LogVol03 tmp /dev/mapper/VolGroup01-LogVol04 /var These may be more up to date than the DAT backups. The date of the last backup is shown by the date of the file /root/disk_backup/copy_system_log DEVICES, CONNECTIONS NETWORK CONFIGURATIONS OF BELL1 and BELL2 -------------------------------------------------------------- Connections between servers -------------------------- Ethernet Device hardware use address eth0 card rsync heartbeat b1: 10.0.2.1 b2: 10.0.2.2 eth1 card DRBD heartbeat b1: 10.0.1.1 b2: 10.0.1.2 eth2 on board internet b1:128.40.52.17 b2:128.40.52.18 eth2:0 on board internet 128.40.52.16 on active server type ifconfig a su prompt to see these serial /dev/ttyS0 on board heardbeat Internet connection cables bell1 red, bell2 blue plugged into built in on board RJ45 ethernet sockets on servers and connected to wall network sockets. Data cables between bell1 and bell2 Crossover ethernet cables plugged into ethernet card sockets Corresponding RJ45 sockets on each card must connect between servers. Corresponding serial cable sockets must connect by serial cable. DO NOT CONNECT BOTH SERVERS To THE INTERNET WITOUT INTERCONNECTING CABLES ABOVE CONNECTED. This would cause them both to attempt to become the active server. Each server has identical hardware, 5 x 73 GB scsi disk drives, 1 CD Rom drive 1 DAT tape drive, USB mouse and keyboard, CRT monitors, UPS power supply. DEVICES, CONNECTIONS NETWORK CONFIGURATIONS WAVE ------------------------------------------------ Ethernet Device hardware use address eth0 on board internet 128.40.52.19 2 SATA 150GB disk drives 1 CD/DVD Read only drive USB mouse and keyboard 17 inch flat panel monitor