Tuesday, January 21, 2020

Fun with grub2-install "not a directory"

These days, I came across spontaneous kiwi image build failures in a private OBS instance.
The images were SLES15-SP1, they had not been touched for quite some time, rebuilds were only triggered due to new packages in the update channel.
The error was grub2-install failing with the error message "not a directory".
Looking at the recent changes in the update repo showed no obvious reason (some python packages that had nothing to do with grub2-install), so I started to investigate...

... 3 days later, after following some detours, I finally found the issue.

grub2-install scans the installation device for filesystems, and probes all known (to grub2) fs types. The probe of "minix_be" fails fatally. Sometimes.

After building my own grub2 package with lots of debug-printf's, I finally found out, that the minix fs detection of grub2 is a little "fragile". It does the following (pseudo code):
  • grub_mount_minix(device) || return "not minix fs"
  • grub_minix_find_file("/") || fatal_error()
The problem is, that grub_mount_minix() only does pretty simple magic numbers checks, which can lead to false positives.

Comparing the superblock structures of ext[234] and minix filesystems (from the grub2 source code) side by side, you see this:

struct grub_minix_sblock         |struct grub_ext2_sblock
{                                |{
  grub_uint16_t inode_cnt;       |  grub_uint32_t total_inodes;
  grub_uint16_t zone_cnt;        |
  grub_uint16_t inode_bmap_size; |  grub_uint32_t total_blocks;
  grub_uint16_t zone_bmap_size;  |
  grub_uint16_t first_data_zone; |  grub_uint32_t reserved_blocks;
  grub_uint16_t log2_zone_size;  |
  grub_uint32_t max_file_size;   |  grub_uint32_t free_blocks;
  grub_uint16_t magic;           |  grub_uint32_t free_inodes;
};

This already hints at the issue: at the same disk location where ext2 stores the free inodes number, minix stores its magic number, which is used by grub to detect if it is a minix file system.

Now if you happen to have an ext3 file system with a free_inodes number whose lower 16 bits resemble one of the GRUB_MINIX_MAGIC numbers, chances are grub_mount_minix() will succeed, but then the attempt to acces the root directory will fail with a fatal error.

This is a plain grub2 bug, which I will probably report upstream and try to get fixed.
However, I need a fix to have my images build again, and the chances of getting a fix into SLES15-SP1 are ... low (and it is a daunting task, even if you are reporting this bug as a big SLES customer), so I built a workaround in my (locally built, lucky me...) python-kiwi package.

It basically does the following, before calling the "chroot grub2-install ...".

  • statvfs() to get the free_inodes number
  • check if the lower 16 bits resemble one of the MINIX_MAGIC numbers
    • if it does, touch a temporary file in
    • unmount and mount again to update the superblock (I missed this at first and wondered why it did not work)
    • unlink the temporary file
  • continue as before
This workaround is ugly as hell, but it does work for me.

P.S.: the detours included first noticing that almost every change I made to the image, like wrapping grub2-install into a wrapper script for debugging) made the issue go away (because of a different free_inodes number), so I always needed to check after every change that the issue was still present, then finding that copying the locales in grub2-install actually triggers an ENOTDIR - "Not a directory", because it misses special handling the /usr/share/locale/locale.alias file. Of course I thought "this is the issue" and patched it out of grub2, just to find that the original problem still persisted... then overnight package updates in SLES15-SP1 making this problem go away and reappear seemingly random... you guess it 😄