LVM, md and friends

Linux' Logical Volume Manager (LVM)

LVM in general

LVM is designed to be an abstraction layer on top of physical drives or RAID, typically mdraid or fakeraid. Keep in mind that fakeraid should be avoided unless you really need it, like in conjunction with dual-booting linux and windows on fakeraid. LVM broadly consists of three elements, the "physical" devices (PV), the volume group (VG) and the logical volume (LV). There can be multiple PVs, VGs and LVs, depending on requirement. More about this below. All commands are given as examples, and all of them can be fine-tuned using extra flags in need of so. In my experience, the defaults work well for most usecases.

I'm mentioning filesystem below too. Where I write ext4, the samme applies to ext2 and ext3.

Create a PV

A PV is the "physical" parts. This does not need to be a physical disk, but can also be another RAID, be it an mdraid, fakeraid, hardware raid, a virtual disk on a SAN or otherwisee or a partition on one of those.

#  These add three PVs, one on a drive (or hardware raid) and another two on mdraids.
pvcreate /dev/sdb
pvcreate /dev/md1 /dev/md2

For more information about pvcreate, see the manual.

Create a VG

The volume group consists of one or more PVs grouped together on which LVs can be placed. If several PVs are grouped in a VG, it's generally a good idea to make sure these PVs have some sort of redundancy, as in mdraid/fakeraid or hwraid. Otherwise it will be like using a RAID-0 with a single point of failure on each of the independant drives. LVM has RAID code in it as well, so you can use that. I haven't done so myself, as I generally stick to mraid. The reason is mdraid is, in my opinion older and more stable and has more users (meaning bugs are reported and fixed faster whenever they are found). That said, I beleive the actual RAID code used in LVM RAID are the same function calls as for mdraid, so it may not be much of a difference. I still stick with mdraid. To create a VG, run

# Create volume group my_vg
vgcreate my_vg /dev/md1

Note that if vgcreate is run with a PV (as /dev/md1 above) that is not defined as a PV (like above), this is done implicitly, so if you don't need any special flags to pvcreate, you can simply skip it and let vgcreate do that for you.

Create an LV

LVs can be compared to partitions, somehow, since they are bounderies of a fraction or all of that of a VG. The difference between them and a partition, however, is that they can be grown or shrunk easily and also moved around between PVs without downtime. This flexibility makes them superiour to partitions as your system can be changed without users noticing it. By default, an LV is alloated "thickly", meaning all the data given to it, is allocated from the VG and thus the PV. The following makes a 100GB LV named "thicklv". When making an LV, I usually allocate what's needed plus some more, but not everything, just to make sure it's space available for growth on any of the LVs on the VG, or new LVs.

# Create a thick provisioned LV named thicklv on the VG my_vg
lvcreate -n thicklv -L 100G my_vg

After the LV is created, a filesystem can be placed on it unless it is meant to be used directly. The application for direct use include swap space, VM storage and certain database systems. Most of these will, however, work on filesystems too, although my testing has shown that on swap space, there is a significant performance gain for using dedicated storage without a filesystem. As for filesystems, most Linux users use either ext4 or xfs. Personally, I generally use XFS these days. See my notes below on filesystem choice.

# Create a filesystem on the LV - this could have been mkfs -t ext4 or whatever filesystem you choose
mkfs -t xfs /dev/my_vg/thicklv

Then just edit /etc/fstab with correct data and run mount -a, and you should be all set.

Create LVM cache

lvcreate -L 1G -n _cache_meta data /dev/md1

lvcreate -l100%FREE -n _cache data /dev/md1

# Some would want --cachemode writeback, but seriously, I wouldn't recommend it. Metadata or data can be easily corrupted in case of failure.
lvconvert --type cache-pool --cachemode writethrough --poolmetadata data/_cache_meta data/_cache

lvconvert --type cache --cachepool data/_cache data/data

Disabling LVM cache

If you for some reason want to disable caching, this will do it cleanly and remove the LVs earlier created for caching the main device.

lvconvert --uncache data/data

Growing devices

LVM objects can be grown and shrunk. If a PV resides on a RAID where a new drive has been added or otherwise grown, or on a partition or virtual disk that has been extended, the PV must be updated to reflect these changes. The following command will grow the PV to the maximum available on the underlying storage.

# Resize the PV /dev/md (not the RAID, only the PV on the RAID) to its full size
pvresize /dev/md1

If a new PV is added, the VG can be grown to add the space on that in addition to what's there already.

# Extend my_vg to include md2 and its space
vgextend my_vg /dev/md2

With more space available in the VG, the LV can now be extended. Let's add another 50GB to it.

# Extend thicklv - add 50GB. If you know you won't need the space anywhere else, you may want to
# lvresize -l+100%FREE instead. Keep in mind the difference between -l (extents) and -L (bytes)
lvresize -L +50G my_vg/thicklv

After the LV has grown, run xfs_growfs (xfs) or resize2fs (ext4) to make use of the new data.

Migration tools

At times, storage regimes change, new storage is added and sometimes it's not easy to migrate with the current hardware or its support systems. Once I had to migrate a 45TiB fileserver from one storage system to another, preferably without downtime. The server originally had three 15TiB PVs of which two were full and the third half full. I resorted to using pvmove to just move the data on the PVs in use to a new PV. We started out with creating a new 50TiB PV, sde, and then to pvmove.

# Attach /dev/sde to the VG my_vg
vgextend my_vg /dev/sde

# Move the contents from /dev/sdb over to /dev/sde block by block. If a target is not given in pvmove,
# the contents of the source (sdb here) will be put somewhere else in the pool.
pvmove /dev/sdb /dev/sde

This took a while (as in a week or so) - the pvmove command uses old code and logic (or at least did that when I migrated this in November 2016), but it affected performance on the server very little, so users didn't notice. After the first PV was migrated, I continued with the other two, one after the other, and after a month or so, it was migrated. We used this on a number of large fileservers, and it worked flawlessly. Before doing the production servers I also tried interrupting pvmove in various ways, including hard resets, and it just kept on after the reboot.

NOTE: Even though it worked well for us, always keep a good backup before doing this. Things may go wrong, and without a good backup, you may be looking for a hard-to-find new job the next morning.

mdadm workarounds

Migrate from a mirror (raid-1) to raid-10

Create a mirror

mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sdb /dev/sdc

LVM (pv/vg/lv) on md0 (see above), put a filesystem on the lv and fill it with some data.

Now, some months later, you want to change this to raid-10 to allow for the same amount of redundancy, but to allow more disks into the system.

mdadm --grow /dev/md0 --level=10
mdadm: Impossibly level change request for RAID1

So - no - doesn't work. But - we're on

Since we're on lvm already, this'll be easy. If you're using filesystems directly on partitions, you can do this the same way, but without the pvmove part, using rsync or whateever you like instead. I'd recommend using lvm for for the new raid, which should be rather obvious from this article. Now plug in a new drive and create a new raid10 on that one. If you two new drives, install both.

# change "missing" to the other device name if you installed two new drives
mdadm --create /dev/md1 --level=10 --raid-devices=2 /dev/vdd missing

Now, as described above, just vgextend the vg, adding the new raid and run pvmove to move the data from the old pv (residing on the old raid1) to the new pv (on raid10). Afte rpvmove is finished (which may take awile, see above), just

vgreduce raidtest /dev/md0
pvremove /dev/md0
mdadm --stop /dev/md0

…and your disk is free to be added to the new array. If the new raid is running in degraded mode (if you created it with just one drive), better don't wait too long, since you don't have redundancy. Just mdadm --add the devs.

If you now have /dev/mdstat telling you your raid10 is active and have three drives, of which one is a spare, it should look something like this

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md1 : active raid10 vdc[3](S) vdb[2] vdd[0]
      8380416 blocks super 1.2 2 near-copies [2/2] [UU]

Just mdadm --grow --raid-devices=3 /dev/md1

RAID section is not complete. I'll write more on migraing from raid10 to raid6 later. It's not straight forward, but easier than this one :)

Thin provisioned LVs

Thin provisioning on LVM is a method used for not allocating all the space given to an LV. For instance, if you have a 1TB VG and want to give an LV 500GB, although it currently only uses a fraction of that, a thin lv can be a good alternative. This will allow for adding a limit to how much it can use, but also to let it grow dynamically without manual work by the sysadmin. Thin provisioning adds another layer of abstaction by creating a special LV as a thin pool from which data is allocated to the thin volume(s)

Create a thin pool

Allocate 1TB to the thin pool to be used for thinly provisioned LVs. Keep in mind that the space allocated to the thin pool is in fact thick provisioned. Only the volumes put on thinpool are thin provisioned. This will create two hidden LVs, one for data and one for metadata. Normally the defaults will do, but check the manual if the data doesn't match the usual patterns (such as billions of files, resulting in huge amounts of metadata). I beleive the metadata part should be resizable at a later time if needed, but I have not tested it.

# Create a pool for thin LVs. Keep in mind that the pool itself is not thin provisioned, only the volumes residing on it
lvcreate --size 1T --type thin-pool --thinpool thinpool my_vg

Create a thin volume

You have the thinpool, now put a thin volume on it. It will allocate some space for metadata, but probably not much (a few megs, perhaps).

# Now, create a volume with a virtual (-V) size of half a terabyte named thin_pool, but using the thinpool (-T) named thin_pool for storage
lvcreate -V 500G -T my_vg/thin_pool --name thinvol

The new volume's device name will be /dev/thinvol. Now, create a filesystem on it, add to fstab and mount it. The df command will return available space according to the virtual size (-V), while lvs will show how much data is actually used on each of the thinly provisioned volumes.

Filesystem dilemmas

XFS or ext4 or something else

The choice of filesystem varies for your use. Distros such as RHEL/CentOS has moved to XFS by default from v7. Debian/Ubuntu still sticks to ext4, like most other. I'll discuss my thoughts below on ext4 vs XFS and leave the other filesystems for later.

ext4

ext4 is probably the most used filesystem on linux as of writing. It's rock stable and has been extended by a lot of extensions since the original ext2. It still retains backward compatibility, being able to mount and fsck ext2 and ext3 with the ext4 driver. However, it doesn't handle large filesystems very well. If a filesystem is created for <16TiB, it cannot be grown to anything larger than 16TiB. This may have changed lately, though. One other issue is handling errors. If a large filesystem (say 10TiB) is damaged, an fsck may take several hours, leading to long downtime.

XFS

XFS, originally developed by SGI some time in the bronze age, but has been worked on heavily after that. RHEL and Centos version 7 and forward, uses XFS as the default filesystem. It is quick, it scales well, but it lacks at least two things ext4 has. It cannot be shrunk (not that you normally need that, but nice if you have a typo or you need to change things). Also, it doesn't allow for automatic fsck on bootup. If something is messed up on the filesystem, you'll have to login as root on the console (not ssh), which might be an issue on some systems. The fsck equivalent, xfs_repair, is a *lot* faster than ext4's fsck, though.

ZFS

ZFS is like a combination of RAID, LVM and a filesystem, supporting full checksumming, auto-healing (given sufficient redundancy), compression, encryption, replication, deduplication (if you're brave and have a ton of memory) and a lot more. It's a separate chapter and needs a lot more talk than paragraph here.

Low level disk handling

This section is for low-level handling of disks, regardless of storage system elsewhere. These apply to mdraid, lvmraid and zfs and possibly other solutions, with the exception of individual drives on hwraid (and maybe fakeraid), since there, the drives are hidden from Linux (or whatever OS).

"Unplug" drive from system

# Find the device unit's number with something like
find /sys/bus/scsi/devices/*/block/ -name sdd
# returning /sys/bus/scsi/devices/3:0:0:0/block/sdd here, 
# meaning 3:0:0:0 is the SCSI device we're looking for.

# Given this is the correct device, and you want to 'unplug' it, do so by
echo 1 > /sys/bus/scsi/devices/3:0:0:0/delete

Rescan disk controllers

If a drive is added and for some reason doesn't get detected, or is removed by the command above, it can be rediscovered with a sysfs command to the scsi host to which it is connected. Since there may be quite a few of these, an easy way is to just scan them all and check the output of dmesg -T after running it.

# Make a list of controllers and send a rescan message to each of them. It won't do anything 
# for those where nothing has changed, but it will show new drives where they have been added.
for host_scan in /sys/class/scsi_host/host*/scan
do
  echo '- - -' > $host_scan
done

Fail injection with debugfs

https://lxadm.com/Using_fault_injection

mdadm stuff

Spare pools

In the old itmes, mdadm supported only a dedicated spare per md device, so if working with a large set of drives (20? 40?), where you'd want to setup more raid sets to increase redundancy, you'd dedicate a spare to each of the raid sets. This has changed (some time ago?), so in these modern, heathen days, you can change mdadm.conf, usually placed under /etc/mdadm, and add 'spare-group=somegroup' where 'somegroup' is a string identifying the spare group. After doing this, run update-initramfs -u and reboot and add a spare to one of the raid sets in the spare group, and md will use that or those spares for all the raidsets in that group.

As some pointed out on #linux-raid @ irc.freenode.net, this feature is very badly documented, but as far as I can see, it works well.

Example config

# Create two raidsets

mdadm --create /dev/md1 --level=6 --raid-devices=6 /dev/vd[efghij]
mdadm --create /dev/md2 --level=6 --raid-devices=6 /dev/vd[klmnop]

# get their UUID etc

mdadm --detail --scan
ARRAY /dev/md/1 metadata=1.2 name=raidtest:1 UUID=1a8cbdcb:f4092350:348b6b80:c054d74c
ARRAY /dev/md/2 metadata=1.2 name=raidtest:2 UUID=894b1b7c:cb7eba70:917d6033:ea5afd2b

# Put those lines into /etc/mdadm/mdadm.conf and and add the spare-group

ARRAY /dev/md/1 metadata=1.2 name=raidtest:1 spare-group=raidtest UUID=1a8cbdcb:f4092350:348b6b80:c054d74c
ARRAY /dev/md/2 metadata=1.2 name=raidtest:2 spare-group=raidtest UUID=894b1b7c:cb7eba70:917d6033:ea5afd2b

# update the initramfs and reboot

update-initramfs -u
reboot

# add a spare drive to one of the raids

mdadm --add /dev/md1 /dev/vdq

# fail a drive on the other raid

mdadm --fail /dev/md2 /dev/vdn

# check /proc/mdstat to see md2 rebuilding with the spare from md1

Tuning

While trying to compare a 12-disk RAID (8TB disks) to a hwraid, mdraid was slower, so we did a few things to speed things up:

While testing with fio, monitor /sys/block/md0/md/stripe_cache_active, and see if it's close to /sys/block/md0/md/stripe_cache_size. If it is, double the size of the latter

echo 4096 > /sys/block/md0/md/stripe_cache_size

Continue until stripe_cache_active stabilises, and then you've found the limit you'll need. You may want to double that for good measure.

(to be continued)

Links

Below are a few links with useful info

Irreversible mdadm failure recovery

ZFS

ZFS can do a lot of things most filesystems or volume managers can't. It checksums all data and metadata and so long that the system uses ECC enabled memory (RAM), it will have complete control over the whole data set, and when (not if) an error occurs, it'll fix the issue without even throwing a warning. To fix the issue, you'll obviously need sufficient redundancy (mirrors or RAIDzN).

ZFS send/receive between machines

ZFS has a send/receive mechanism that allows for snapshots to be sent over the wire to a receiving end. This can be a full filesystem, or an incremental change since last send/reveive. In a WAN scenario, you'll probably want to use VPN for this. On a controlled network, sending in cleartext is also possible. You may as well use ssh, but keep in mind that ssh was never designed to be used as a fullblown vpn solution and is just too slow or the task with large amounts of data. You may, though, use mbuffer, if on a loal LAN.

# Allow traffic from the sending IP in whatever firewall interface you're using (if you're using a firewall at all, that is)
ufw allow from x.x.x.x
# 
# Start the receiver first. This listens on port 9090, has a 1GB buffer,
    and uses 128kb chunks (same as zfs):

mbuffer -s 128k -m 1G -I 9090 | zfs receive data/filesystem

To be continued one day… See this for some more. I'll get back with details on it.

General Linux system control

Add a new CPU (core)

If working with virtual machines, adding a new core can be useful if the VM is slow and the load is multithreaded or multiprocess. To do this, add a new CPU in the hypervisor (be it KVM or VMware or Hyper-V or whatever). Some hypervisors, like VMware, will activate it automatically if open-vm-tools is installed. Please note that open-vm-tools is for VMware only. I don't know of any such thing for KVM or other hypervisors (although I beleive VirtualBox has its own set of tools, but not as a package). If this doesn't work, run

echo 1 > /sys/devices/system/cpu/cpuX/online

Where X is the CPU number you want to add. You can 'cat /sys/devices/system/cpu/cpuX/online' to check if it's online.

Mount partitions on a disk image

Sometimes you'll have a disk image, either from recovering a bad drive after hours (or days or weeks) of ddrescue, and sometimes you just have a disk image from a virtual machine where you want to mount the partitions inside the image. There are three main methods of doing this, which are more or less synonmous, but some easier than the other.

Basics

A disk image is usually like a disk, consisting of either a large filesystem, a PV or a partition table with one or partitions onto which resides a filesystem, a pv, swap or something else. If it's partitioned, and you want your operating system to mount a filesystem or allow it to see whatever LVM stuff lies on it, you need to isolate the partitions.

Manual mapping

In the oldern days, this was done manually, something like this

root@mymachine:~# fdisk -l /dev/vda
Disk /dev/vda: 25 GiB, 26843545600 bytes, 52428800 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xc37b7ea5

Device     Boot    Start      End  Sectors  Size Id Type
/dev/vda1  *        2048 50333311 50331264   24G 83 Linux
/dev/vda2       50333312 52426369  2093058 1022M  5 Extended
/dev/vda5       50333314 52426369  2093056 1022M 82 Linux swap / Solaris
root@mymachine:~#

To calculate the start and end of each partition, you just take the start sector and multiply it by the sector size (512 bytes here) and you have the offset in bytes. After this, it's merely an losetup -o <offset> /dev/loopX <nameofimagefile> and you have the partition mapped to your /dev/loopX (having X being something typically 0-255. After this, just mount it or run pvscan/vgscan/lvscan to probe whatever lvm config is there, and you should be able to mount it like any other filesystem.

kpartx

kpartx does the same as above, more or less, just without so much hassle. Most of it should be automatic and easy to deal with, except it may fail to disconnect the loopback devices sometimes, so you may have to losetup -d them manually. See the manual, kpartx (8). Please note that partx works similarly and I'd guess the authors of the two tools are arguing a lot of which one's the best.

guestmount

guestmount is something similar again, but also supports file formats like qcow2 and possibly other, proprietary filesystems like vmdk and vdi, but don't take my word for it - I haven't tested. Again, see the manual or just read up on the description here. It seems rather trivial.

Linux on laptops

Wifi with HP EliteBook 725/745 on Ubuntu

The HP EliteBook 725/745 and similar are equipped with a BCM4352 wifi chip. As with a lot of other stuff from Broadcom, this lacks an open hardware description, so thus no open driver exist. The proper fix, is to replace the NIC with something with an open driver, but again, this isn't always possible, so a binary driver exists. In ubuntu, run, somehow connect the machine to the internet and run

sudo apt-get update
sudo apt-get install bcmwl-kernel-source
sudo modprobe wl

If you can't connect the machine to the internet, download the needed packages on another machine and put it on some usb storage. These packages should suffice:

# apt-cache depends bcmwl-kernel-source
bcmwl-kernel-source
  Depends: dkms
  Depends: linux-libc-dev
  Depends: libc6-dev
  Conflicts: <bcmwl-modaliases>
  Replaces: <bcmwl-modaliases>

Then install manually with dpkg -i file1.deb file2.deb etc

After this, ip/ifconfig/iwconfig shouldd see the new wifi nic, probably wlan0, but due to a old, ignored bug, you won't find any wifi networks. This is because the driver from Broadcom apparently does not support interrupt remapping, commonly used on x64 machines. To turn this off, change /etc/default/grub and change the line GRUB_CMDLINE_LINUX_DEFAULT="quiet splash", adding intremap=off to the end before the quote: GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intremap=off". Save the file and run sudo update-grub and reboot. After this, wifi should work well.

3D printer stuff

Hydrophilic filament

Most popular filament types, including PLA, PETG, PP or PA (Polyamide, normally known as Nylon) and virtualy any filament type named poly-something (and thus with an acronym of P-something) are hydrophilic (also (somewhat incorrectly) called hydroscopic), meaning so long they're dryer than the atmosphere around them, they'll do their best to suck out the atmosphere's humidity. This is why most filament are shrink-wrapped with a small bag of silica gel in it. If such filament is exposed for normal humid air for some time, it'll become very hard to use. Most of my experience is with PLA, which can handle a bit (depending on make), but still not a lot. With PLA, if you end up with prints where the filament seems not to stick, it may help with a higher temperature. Otherwise, the filament must be baked. If you don't want to use your oven, try something like a fruit and mushroom dryer. Then get a good, sealble plastic box and add something like half a kilogram of silica gel to an old sock, tie it and put it in the your new dry box.

Please note that ABS is hydrophobic, so you won't have to worry too much there. Also, remember that PA/Nylon is extremely hydrophilic. Even a couple of days in normal air humidity can ruin it completely and will require baking.

Upgrading the firmware on an Ender 3

This is about upgrading the firmware, and thus also flashing a bootloader to the Creality Ender 3 3D printer. Most of this is well documented on several place, like o the video from Teaching tech and elsewhere, but I'll just summarise some issues I had.

Using an arduino Nano as a programmer

The CR-10, Ender 3 and a few other printers from Creality come without a bootloader, so you'll need to flash one before upgrading the firmware. Well, strictly speakning, you don't need one, you can save those 8kB or so for other stuff and use a programmer to write the whole firmware, but then you'll need to wire up everything every time you want a new firmware, so in my world, you do need the bootloader, since it's easier. Note that some newer printers, like the CR-10S and possibly all newer ones, come with the bootloader already and thus won't need another.

To install a bootloader, you'll need a programmer, and I didn't have one available. Having some el-cheap china-copies of Ardinos around, I read I could use one and tried so. Although the docs mostly mention the Uno etc, it's just as simple with the Nano copy.

So, grab an arduino, plug it to your machine with an USB cable, and, using the Arduino IDE, use the ArduinoISP sketch there (found under Examples) to burn the software needed for the arduino to work as a programmer. When this is done, connect a 10µF capacitor between RST and GND to stop it from resetting when used as a programmer. Connect the MOSI, MISO and SCK pins from the ICSP block (that little isolated 6-pin block on top of the ardu). See here for a pinout. Then find Vcc and GND either onthe ICSP block or elsewhere. Some sites say that on some boards you shouldn't use the pins on the ICSP block for this. I doubt it matters, though. Then connect another jumper wire from pin 10 on the ardu to work as the reset pin outwards to the chip being programmed (that is, on the printer). The ICSP jumper block pin layout is the same on the printer and on the ardu, and probably elsewhere. Sometimes standardisation actually works…

See https://cloud.karlsbakk.net/index.php/s/zw48YJjzaf8Rc2F for a pic with the ardu (this wiki is broken atm, will fix it later)

Flashing the printer

When the boot loader is in place, possibly after some wierd errors from during the process, the screen on the 3d printer will go blank and it'll behave like being bricked. This is ok. Now disconnect the wires and connect the USB cable between computer and printer, download the TH3D unified firmware and configure it for your particular needs. The video mentioned above describes this well. After flashing the new firmware, you'll probably get some timeouts and errors, since apparently it resets the printer after giving it the new firmware, but without notifying the flashing software of it doing so. If this happens, hold your breath and reboot the printer and check its tiny monitor - it should work. If it doesn't work, try to flash it again, and if that doesn't work, try flashing the programmer again yet another time, just for kicks.

PS: I tried enabling MANUAL_MESH_LEVELING, which in the code says that If used with a 1284P board the bootscreen will be disabled to save space. This effectively disabled the whole thing, and apparently may be making the firmware image a wee bit too large for it to fit the 1284P on the ender. It works well without it, though.

And then…

With the new firmware in place, the printer should work as before, although with thermal runaway protection to better avoid fires in case the shit hits the fan, and to allow for things like autolevelling. With any luck, this thing may be ready for use by the time you read this - it surely looks nice!

roy

Roy's notes