App Layering: "Welcome to Emergency Mode" usually means the Repository logical volume is damaged

App Layering: "Welcome to Emergency Mode" usually means the Repository logical volume is damaged

book

Article ID: CTX234956

calendar_today

Updated On:

Description

After a reboot, the ELM refuses to boot, instead reporting you are now in Emergency Mode:

Emergency mode

Welcome to emergency mode! After logging in, type "journalctl -xb" to view
system logs, "systemctl reboot" to reboot, "systemctl default" or ^D to
try again to boot into default mode
Give root password for maintenance
(or type Control-D to continue):


 

Resolution

Shutdown the ELM.  Make a snapshot now.  Then power back on and login as root, using your normal "root" password.

This error means you have a fatal error in the Layering Service layer repository store.  Beware, it may not be possible to recover this.  However, your recovery efforts need to be focused on that area.

Your first instinct might be that the boot disk partitions need an fsck, like CTX221751.  In reality, this is not true.  App Layering uses XFS as the filesystem for both the boot partitions and the repository store, and when you attempt to fsck an XFS filesystem, fsck returns success without doing anything.  XFS is a self-repairing, journaled filesystem that should never need this kind of repair. 

Although there is a tool called "xfs_repair", it cannot be run on a mounted filesystem.  So if you really believe that you need to run xfs_repair on /dev/sda1 or /dev/sda2 (the boot and root partitions on the boot disk), you will need to boot up another Linux machine and attach the boot disk from your ELM manually to that machine in order to do that.  That's beyond the scope fof this article, and has never yet been necessary in App Layering, so we will not go into details here.

The Layering Service layer repository is a "logical volume" built using the Linux Logical Volume Manager (LVM) tools.  This is how we allow you to expand the layer repository: we simply take any extra space or blank disks you provide, initialize them for use in the LVM, expand the volume group (VG), and expand the Logical Volume (LV) itself.  Your VG could be composed of multiple Physical Volumes (PV) with your data spanned across the disks.  If your VG is damaged in a way that LVM cannot recover from, you may not be able to get access to the data. 

General troubleshooting guidance for the ELM's LVM can be found in the article App Layering: How to troubleshoot the layer repository disk  as well.

Having SAN-level snapshots, backups or even clones of the ELM can help guard against complete data-loss in situations like this.

Getting the complete history of storage operations is critical at this point.  You need to know when the repository was expanded and how, to be able to determine what course forward you have.  It's not possible to lay out all the ways you can have LVM problems.  Instead we will walk through one common scenario: a disk added to the LVM is deleted.

Imagine you start with the initial 300GB disk.  You then expand it with a 3000GB disk.  Then you decide you didn't want 3000GB, so you delete the disk, and add a 300GB disk and expand into that.  You think you now have a 600GB volume.  The ELM thinks you have a 3600GB volume.  The additional expansion could succeed as long as LVM never tried to access data in the 3000GB gap in the middle.

This can get a lot more complicated, too, because you can expand your original disk as well, at an time.  So you could start with 300GB, add a 200GB disk, expand the initial disk to 400GB, add another 200GB disk, and expand the first 200GB disk to 300GB.  From the user perspective, you have a single 900GB volume, but in LVM, there are 5 separate segments spread across three disks in chronological order.  While we could probably recover from deleting the third disk with the 4th segment, we probably cannot recover from deleting the second disk with the second and fifth segments.

In all cases, your only hope for full recovery is if the missing disk has not had any data written to it.  If you add a disk and immediately delete it, then you have some pretty good hope for recovery.  If you add a disk, use it for a month, and then delete it, you are very likely to have corrupted or completely missing layer files.

In the ELM, there is only ever one VG, named unidesk_vg, into which all PVs are concatenated.  That VG contains one LV, called xfs_lv, and accessible as /dev/unidesk_vg/xfs_lv.  This is true no matter what platform the ELM is based on.  The disk device names may change (/dev/sdb versus /dev/xvdb), but the VG and LV names are consistent.

Note, LVM stores its configuration in /etc/lvm.  The current configuration can be found in /etc/lvm/backup/unidesk_vg, and previous copies of the configuration (copies are made before each LV operation - see the "description" line) are stored in /etc/lvm/archive.  While reading those files is well beyond the scope of this article, it's possible to piece together the history of LVM operations in thie ELM by reading through the archive files in chronological order.

/etc/lvm/archive:
total 16
-rw-------. 1 root root  913 Feb  6 12:46 unidesk_vg_00000-1214181995.vg
-rw-------. 1 root root  924 Feb  6 12:46 unidesk_vg_00001-58662273.vg
-rw-------  1 root root 1360 May  4 11:01 unidesk_vg_00002-1911836168.vg
-rw-------  1 root root 1612 May  4 11:01 unidesk_vg_00003-1711355933.vg

/etc/lvm/backup:
total 4
-rw------- 1 root root 1789 May  4 11:01 unidesk_vg


There are three basic tools for Linux LVM: pvdisplay (show the physical volumes in your LVM), vgdisplay (show your "volume groups" built up from your PVs), and lvdisplay (show the "logical volumes" carved out of your VGs). Use those to determine the UUID of the missing disk.  In this example, we're going to simply give LVM back the disk it's missing, using the GUID that the old disk had.  This will allow LVM to bring back up the VG, but will leave you with a hole in the middle of your VG.

First, run pvdisplay to see your present and missing PVs.

  WARNING: Device for PV w3F3ad-tmK8-DfPL-eWlN-aeNg-KLbW-ZkLr05 not found or rejected by a filter.
  --- Physical volume ---
  PV Name               /dev/xvdb
  VG Name               unidesk_vg
  PV Size               300.00 GiB / not usable 4.00 MiB
  Allocatable           yes (but full)
  PE Size               4.00 MiB
  Total PE              76799
  Free PE               0
  Allocated PE          76799
  PV UUID               KzGOla-iLmf-Mog0-YYOd-9EWn-S1ug-fjW0nx
   
  --- Physical volume ---
  PV Name               [unknown]
  VG Name               unidesk_vg
  PV Size               100.00 GiB / not usable 4.00 MiB
  Allocatable           yes (but full)
  PE Size               4.00 MiB
  Total PE              25599
  Free PE               0
  Allocated PE          25599
  PV UUID               w3F3ad-tmK8-DfPL-eWlN-aeNg-KLbW-ZkLr05


Then run vgdisplay and lvdisplay to ensure that the LV and VG size agree and match the sum of the PVs listed.  We only care about the deleted PV disk, but now is your best opportunity to see if you have more serious problems.

  WARNING: Device for PV w3F3ad-tmK8-DfPL-eWlN-aeNg-KLbW-ZkLr05 not found or rejected by a filter.
  --- Logical volume ---
  LV Path                /dev/unidesk_vg/xfs_lv
  LV Name                xfs_lv
  VG Name                unidesk_vg
  LV UUID                Iechln-zjD7-W2gf-55mH-d1Ou-aqDa-QhZWQd
  LV Write Access        read/write
  LV Creation host, time localhost.localdomain, 2018-02-06 12:46:11 -0500
  LV Status              NOT available
  LV Size                399.99 GiB
  Current LE             102398
  Segments               2
  Allocation             inherit
  Read ahead sectors     auto
   
  WARNING: Device for PV w3F3ad-tmK8-DfPL-eWlN-aeNg-KLbW-ZkLr05 not found or rejected by a filter.

  --- Volume group ---
  VG Name               unidesk_vg
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  4
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                1
  Open LV               0
  Max PV                0
  Cur PV                2
  Act PV                1
  VG Size               399.99 GiB
  PE Size               4.00 MiB
  Total PE              102398
  Alloc PE / Size       102398 / 399.99 GiB
  Free  PE / Size       0 / 0   
  VG UUID               GU8esp-euNA-qMDO-UH9Z-V0LB-Xzvs-5YsUPG


The three important pieces of information to determine are the GUID of the missing PV, its size, and that its PV Name really is "unknown".  The PV Name normally tells you the device that the PV is currently found on.  The device can change; PVs are known by their GUIDs, not their physical location.  But it's important to make sure that the PV you're about to create is listed as being attached to "unknown".

Now attach a new, correctly-sized disk to your virtual machine.  If you really can't be sure for some reason, overestimate.  Extra space at the end is unused, but coming up short is likely a disaster.  Always remember that you have a snapshot.

Then get Linux to recognize the new SCSI disk you just attached by rebooting.  Since there are no other processes running, a reboot is the safest way to get the new, blank disk available.  Depending on your hypervisor, you may need to power-off and power back on to get the disk recognized.

Use "fdisk -l" to identify the device path for the new, empty disk.  Note that every PV disk has no partition, so you need to make some intelligent guesses about which disk is being used.  Run "pvdisplay" again to make sure you're not considering any disks that are already in use.  In this case, /dev/xvdc is not used in pvdisplay.

Disk /dev/xvdc: 107.4 GB, 107374182400 bytes, 209715200 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/xvda: 32.2 GB, 32214351872 bytes, 62918656 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0x000b3644

    Device Boot      Start         End      Blocks   Id  System
/dev/xvda1   *        2048     1026047      512000   83  Linux
/dev/xvda2         1026048    41986047    20480000   83  Linux
/dev/xvda3        41986048    58763263     8388608   82  Linux swap / Solaris

Disk /dev/xvdb: 322.1 GB, 322122547200 bytes, 629145600 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Then run this command (substituting in the correct disk instead of /dev/xvdc) to create the new PV with the old UUID. 

pvcreate --uuid=w3F3ad-tmK8-DfPL-eWlN-aeNg-KLbW-ZkLr05 /dev/xvdc --restorefile /etc/lvm/backup/unidesk_vg

Success looks like this:

  Couldn't find device with uuid w3F3ad-tmK8-DfPL-eWlN-aeNg-KLbW-ZkLr05.
  WARNING: Device for PV w3F3ad-tmK8-DfPL-eWlN-aeNg-KLbW-ZkLr05 not found or rejected by a filter.
  Physical volume "/dev/xvdc" successfully created.


If you see a "Can't find uuid" error like this, then you have mistyped the UUID.  Double-check the ID and re-enter the command.  (See if you can figure out where I substituted a capital i for a lower-case L.)

  Couldn't find device with uuid w3F3ad-tmK8-DfPL-eWlN-aeNg-KLbW-ZkLr05.
  Can't find uuid w3F3ad-tmK8-DfPL-eWIN-aeNg-KLbW-ZkLr05 in backup file /etc/lvm/backup/unidesk_vg
  Run `pvcreate --help' for more information.


Once you have confirmation that the PV is created, reboot.  The LVM should start up, and the system will be functional, including the management console.  However, your layer disk data may be corrupted.  Make an immediate backup of the system and then test to see how bad the damage might be.

Login as root again, and perform the following to attempt to repair the XFS filesystem.  This may or may not actually fix the problem with the large void in the middle of the volume, but it is important to gt this started before you start using the repository.

# umount /mnt/repository
# xfs_repair /dev/unidesk_vg/xfs_lv


xfs_repair may produce a lot of output.  I have no guidance for interpreting that output.  Reboot after the repair.

Note: For ELM appliance in Azure, Please use CitrixAdmin account / use sudo before passing commands and supply citrixadmin account password.

- You can simply reset CitriAdmin account password from Azure just incase someone forgot (Go to ELM appliance in Azure -> Reset Password)