The images were SLES15-SP1, they had not been touched for quite some time, rebuilds were only triggered due to new packages in the update channel.
The error was grub2-install failing with the error message "not a directory".
Looking at the recent changes in the update repo showed no obvious reason (some python packages that had nothing to do with grub2-install), so I started to investigate...
... 3 days later, after following some detours, I finally found the issue.
grub2-install scans the installation device for filesystems, and probes all known (to grub2) fs types. The probe of "minix_be" fails fatally. Sometimes.
After building my own grub2 package with lots of debug-printf's, I finally found out, that the minix fs detection of grub2 is a little "fragile". It does the following (pseudo code):
- grub_mount_minix(device) || return "not minix fs"
- grub_minix_find_file("/") || fatal_error()
The problem is, that grub_mount_minix() only does pretty simple magic numbers checks, which can lead to false positives.
Comparing the superblock structures of ext[234] and minix filesystems (from the grub2 source code) side by side, you see this:
struct grub_minix_sblock |struct grub_ext2_sblock { |{ grub_uint16_t inode_cnt; | grub_uint32_t total_inodes; grub_uint16_t zone_cnt; | grub_uint16_t inode_bmap_size; | grub_uint32_t total_blocks; grub_uint16_t zone_bmap_size; | grub_uint16_t first_data_zone; | grub_uint32_t reserved_blocks; grub_uint16_t log2_zone_size; | grub_uint32_t max_file_size; | grub_uint32_t free_blocks; grub_uint16_t magic; | grub_uint32_t free_inodes; };
This already hints at the issue: at the same disk location where ext2 stores the free inodes number, minix stores its magic number, which is used by grub to detect if it is a minix file system.
Now if you happen to have an ext3 file system with a free_inodes number whose lower 16 bits resemble one of the GRUB_MINIX_MAGIC numbers, chances are grub_mount_minix() will succeed, but then the attempt to acces the root directory will fail with a fatal error.
This is a plain grub2 bug, which I will probably report upstream and try to get fixed.
However, I need a fix to have my images build again, and the chances of getting a fix into SLES15-SP1 are ... low (and it is a daunting task, even if you are reporting this bug as a big SLES customer), so I built a workaround in my (locally built, lucky me...) python-kiwi package.
It basically does the following, before calling the "chroot
- statvfs(
) to get the free_inodes number - check if the lower 16 bits resemble one of the MINIX_MAGIC numbers
- if it does, touch a temporary file in
- unmount and mount
again to update the superblock (I missed this at first and wondered why it did not work) - unlink the temporary file
- continue as before
This workaround is ugly as hell, but it does work for me.
P.S.: the detours included first noticing that almost every change I made to the image, like wrapping grub2-install into a wrapper script for debugging) made the issue go away (because of a different free_inodes number), so I always needed to check after every change that the issue was still present, then finding that copying the locales in grub2-install actually triggers an ENOTDIR - "Not a directory", because it misses special handling the /usr/share/locale/locale.alias file. Of course I thought "this is the issue" and patched it out of grub2, just to find that the original problem still persisted... then overnight package updates in SLES15-SP1 making this problem go away and reappear seemingly random... you guess it 😄