12.1.1.2. Setting up RAID
Setting up RAID volumes requires the
mdadm
package; it provides the
mdadm
command, which allows creating and manipulating RAID arrays, as well as scripts and tools integrating it to the rest of the system, including the monitoring system.
Our example will be a server with a number of disks, some of which are already used, the rest being available to setup RAID. We initially have the following disks and partitions:
the
sda
disk, 4 GB, is entirely available;
the
sde
disk, 4 GB, is also entirely available;
on the
sdg
disk, only partition
sdg2
(about 4 GB) is available;
finally, a
sdh
disk, still 4 GB, entirely available.
We're going to use these physical elements to build two volumes, one RAID-0 and one mirror (RAID-1). Let's start with the RAID-0 volume:
#
mdadm --create /dev/md0 --level=0 --raid-devices=2 /dev/sda /dev/sde
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
#
mdadm --query /dev/md0
/dev/md0: 8.00GiB raid0 2 devices, 0 spares. Use mdadm --detail for more detail.
#
mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Thu Sep 30 15:21:15 2010
Raid Level : raid0
Array Size : 8388480 (8.00 GiB 8.59 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Thu Sep 30 15:21:15 2010
State : active
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Chunk Size : 512K
Name : squeeze:0 (local to host squeeze)
UUID : 0012a273:cbdb8b83:0ee15f7f:aec5e3c3
Events : 0
Number Major Minor RaidDevice State
0 8 0 0 active sync /dev/sda
1 8 64 1 active sync /dev/sde
#
mkfs.ext4 /dev/md0
mke2fs 1.41.12 (17-May-2010)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
524288 inodes, 2097152 blocks
104857 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=2147483648
55 block groups
32768 blocks per group, 32768 fragments per group
8160 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done
This filesystem will be automatically checked every 26 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
#
mkdir /srv/raid-0
#
mount /dev/md0 /srv/raid-0
#
df -h /srv/raid-0
Filesystem Size Used Avail Use% Mounted on
/dev/md0 8.0G 249M 7.4G 4% /srv/raid-0
The
mdadm --create
command requires several parameters: the name of the volume to create (
/dev/md*
, with MD standing for
Multiple Device), the RAID level, the number of disks (which is compulsory despite being mostly meaningful only with RAID-1 and above), and the physical drives to use. Once the device is created, we can use it like we'd use a normal partition, create a filesystem on it, mount that filesystem, and so on. Note that our creation of a RAID-0 volume on
md0
is nothing but coincidence, and the numbering of the array doesn't need to be correlated to the chosen amount of redundancy.
Creation of a RAID-1 follows a similar fashion, the differences only being noticeable after the creation:
#
mdadm --create /dev/md1 --level=1 --raid-devices=2 /dev/sdg2 /dev/sdh
mdadm: largest drive (/dev/sdg2) exceed size (4194240K) by more than 1%
Continue creating array?
y
mdadm: array /dev/md1 started.
#
mdadm --query /dev/md1
/dev/md1: 4.00GiB raid1 2 devices, 0 spares. Use mdadm --detail for more detail.
#
mdadm --detail /dev/md1
/dev/md1:
Version : 1.2
Creation Time : Thu Sep 30 15:39:13 2010
Raid Level : raid1
Array Size : 4194240 (4.00 GiB 4.29 GB)
Used Dev Size : 4194240 (4.00 GiB 4.29 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Thu Sep 30 15:39:26 2010
State : active, resyncing
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Rebuild Status : 10% complete
Name : squeeze:1 (local to host squeeze)
UUID : 20a8419b:41612750:b9171cfe:00d9a432
Events : 27
Number Major Minor RaidDevice State
0 8 98 0 active sync /dev/sdg2
1 8 112 1 active sync /dev/sdh
#
mdadm --detail /dev/md1
/dev/md1:
[...]
State : active
[...]
A few remarks are in order. First,
mdadm
notices that the physical elements have different sizes; since this implies that some space will be lost on the bigger element, a confirmation is required.
More importantly, note the state of the mirror. The normal state of a RAID mirror is that both disks have exactly the same contents. However, nothing guarantees this is the case when the volume is first created. The RAID subsystem will therefore provide that guarantee itself, and there will be a synchronisation phase as soon as the RAID device is created. After some time (the exact amount will depend on the actual size of the disks…), the RAID array switches to the “active” state. Note that during this reconstruction phase, the mirror is in a degraded mode, and redundancy isn't assured. A disk failing during that risk window could lead to losing all the data. Large amounts of critical data, however, are rarely stored on a freshly created RAID array before its initial synchronisation. Note that even in degraded mode, the
/dev/md1
is usable, and a filesystem can be created on it, as well as some data copied on it.
Now let's see what happens when one of the elements of the RAID-1 array fails.
mdadm
, in particular its
--fail
option, allows simulating such a disk failure:
#
mdadm /dev/md1 --fail /dev/sdh
mdadm: set /dev/sdh faulty in /dev/md1
#
mdadm --detail /dev/md1
/dev/md1:
[...]
Update Time : Thu Sep 30 15:45:50 2010
State : active, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 1
Spare Devices : 0
Name : squeeze:1 (local to host squeeze)
UUID : 20a8419b:41612750:b9171cfe:00d9a432
Events : 35
Number Major Minor RaidDevice State
0 8 98 0 active sync /dev/sdg2
1 0 0 1 removed
2 8 112 - faulty spare /dev/sdh
The contents of the volume are still accessible (and, if it's mounted, the applications don't notice a thing), but the data safety isn't assured anymore: should the
sdg
disk fail in turn, the data would be lost. We want to avoid that risk, so we'll replace the failed disk with a new one,
sdi
:
#
mdadm /dev/md1 --add /dev/sdi
mdadm: added /dev/sdi
#
mdadm --detail /dev/md1
/dev/md1:
[...]
Raid Devices : 2
Total Devices : 3
Persistence : Superblock is persistent
Update Time : Thu Sep 30 15:52:29 2010
State : active, degraded, recovering
Active Devices : 1
Working Devices : 2
Failed Devices : 1
Spare Devices : 1
Rebuild Status : 45% complete
Name : squeeze:1 (local to host squeeze)
UUID : 20a8419b:41612750:b9171cfe:00d9a432
Events : 53
Number Major Minor RaidDevice State
0 8 98 0 active sync /dev/sdg2
3 8 128 1 spare rebuilding /dev/sdi
2 8 112 - faulty spare /dev/sdh
#
[...]
[...]
#
mdadm --detail /dev/md1
/dev/md1:
[...]
Update Time : Thu Sep 30 15:52:35 2010
State : active
Active Devices : 2
Working Devices : 2
Failed Devices : 1
Spare Devices : 0
Name : squeeze:1 (local to host squeeze)
UUID : 20a8419b:41612750:b9171cfe:00d9a432
Events : 71
Number Major Minor RaidDevice State
0 8 98 0 active sync /dev/sdg2
1 8 128 1 active sync /dev/sdi
2 8 112 - faulty spare /dev/sdh
Here again, the kernel automatically triggers a reconstruction phase during which the volume, although still accessible, is in a degraded mode. Once the reconstruction is over, the RAID array is back to a normal state. One can then tell the system that the
sdh
disk is about to be removed from the array, so as to end up with a classical RAID mirror on two disks:
#
mdadm /dev/md1 --remove /dev/sdh
mdadm: hot removed /dev/sdh from /dev/md1
#
mdadm --detail /dev/md1
/dev/md1:
[...]
Number Major Minor RaidDevice State
0 8 98 0 active sync /dev/sdg2
1 8 128 1 active sync /dev/sdi
From then on, the drive can be physically removed when the server is next switched off, or even hot-removed when the hardware configuration allows hot-swap. Such configurations include some SCSI controllers, most SATA disks, and external drives operating on USB or Firewire.
12.1.1.3. Backing up the Configuration
Most of the meta-data concerning RAID volumes are saved directly on the disks that make up these arrays, so that the kernel can detect the arrays and their components and assemble them automatically when the system starts up. However, backing up this configuration is encouraged, because this detection isn't fail-proof, and it's only expected that it will fail precisely in sensitive circumstances. In our example, if the
sdh
disk failure had been real (instead of simulated) and the system had been restarted without removing this
sdh
disk, this disk could start working again due to having been probed during the reboot. The kernel would then have three physical elements, each claiming to contain half of the same RAID volume. Another source of confusion can come when RAID volumes from two servers are consolidated onto one server only. If these arrays were running normally before the disks were moved, the kernel would be able to detect and reassemble the pairs properly; but if the moved disks had been aggregated into an
md1
on the old server, and the new server already has an
md1
, one of the mirrors would be renamed.
Backing up the configuration is therefore important, if only for reference. The standard way to do it is by editing the
/etc/mdadm/mdadm.conf
file, an example of which is listed here:
Example 12.1. mdadm
configuration file
# mdadm.conf
#
# Please refer to mdadm.conf(5) for information about this file.
#
# by default, scan all partitions (/proc/partitions) for MD superblocks.
# alternatively, specify devices to scan, using wildcards if desired.
DEVICE /dev/sd*
# auto-create devices with Debian standard permissions
CREATE owner=root group=disk mode=0660 auto=yes
# automatically tag new arrays as belonging to the local system
HOMEHOST <system>
# instruct the monitoring daemon where to send mail alerts
MAILADDR root
ARRAY /dev/md0 metadata=1.2 name=squeeze:0 UUID=6194b63f:69a40eb5:a79b7ad3:c91f20ee
ARRAY /dev/md1 metadata=1.2 name=squeeze:1 UUID=20a8419b:41612750:b9171cfe:00d9a432
One of the most useful details is the
DEVICE
option, which lists the devices where the system will automatically look for components of RAID volumes at start-up time. In our example, we replaced the default value,
partitions
, with an explicit list of device files, since we chose to use entire disks and not only partitions, for some volumes.
The last two lines in our example are those allowing the kernel to safely pick which volume number to assign to which array. The metadata stored on the disks themselves are enough to re-assemble the volumes, but not to determine the volume number (and the matching
/dev/md*
device name).
Fortunately, these lines can be generated automatically:
#
mdadm --misc --detail --brief /dev/md?
ARRAY /dev/md0 metadata=1.2 name=squeeze:0 UUID=6194b63f:69a40eb5:a79b7ad3:c91f20ee
ARRAY /dev/md1 metadata=1.2 name=squeeze:1 UUID=20a8419b:41612750:b9171cfe:00d9a432
The contents of these last two lines doesn't depend on the list of disks included in the volume. It is therefore not necessary to regenerate these lines when replacing a failed disk with a new one. On the other hand, care must be taken to update the file when creating or deleting a RAID array.