Redundant iSCSI storage for Linux
Here's how to set up relatively cheap redundant iSCSI storage on Linux. The redundancy is achieved using LVM mirroring, and the storage servers consist of commodity hardware, running the OpenFiler Linux distribution, which expose their disks to the clients using iSCSI over Ethernet. The servers are completely separate entities, and the purpose of this mirroring is to keep the logical volumes available, even while one of the storage servers is down for maintenance or due to hardware failure.
Ultimately the disks of the iSCSI target servers will show up as normal SCSI disks on the client (/dev/sdb, /dev/sdc, ...). The data will be moved across the network transparently. It is preferable to use multiple gigabit network interface cards on both the initiator and the target, and bond them together for reliability and speed gain (or use Device Mapper Multipath). A separate VLAN for iSCSI traffic is recommended for security and speed. By default, the traffic is not encrypted so your disk blocks can easily be sniffed using tcpdump.
I created identical logical volumes on both OpenFiler servers and mapped them to iSCSI targets. The iSCSI initiator (client) here is an Ubuntu 9.04 desktop.
Install Open-iSCSI and map targets
On the client, install Open-iSCSI.
aptitude install open-iscsi
Run the discovery to see available targets (the IP address is the address of one of the servers).
iscsiadm -m discovery -t st -p 192.168.1.115
You should get a target list as the output.
192.168.1.115:3260,1 iqn.2006-01.com.openfiler:linuxtest1lv
Map the target to a SCSI disk.
iscsiadm -m node -T iqn.2006-01.com.openfiler:linuxtest1lv -p 192.168.1.115 --login
dmesg should now show a that a new SCSI disk was detected.
[600584.938727] scsi 2:0:0:0: Direct-Access OPNFILER VIRTUAL-DISK 0 PQ: 0 ANSI: 4 [600584.947903] sd 2:0:0:0: [sdb] 4194304 512-byte hardware sectors: (2.14 GB/2.00 GiB) [600584.983070] sd 2:0:0:0: [sdb] Write Protect is off [600584.983074] sd 2:0:0:0: [sdb] Mode Sense: 77 00 00 08 [600584.988064] sd 2:0:0:0: [sdb] Write cache: disabled, read cache: disabled, doesn't support DPO or FUA [600584.989379] sd 2:0:0:0: [sdb] 4194304 512-byte hardware sectors: (2.14 GB/2.00 GiB) [600584.989974] sd 2:0:0:0: [sdb] Write Protect is off [600584.989977] sd 2:0:0:0: [sdb] Mode Sense: 77 00 00 08 [600584.991359] sd 2:0:0:0: [sdb] Write cache: disabled, read cache: disabled, doesn't support DPO or FUA [600584.991363] sdb: unknown partition table [600585.008012] sd 2:0:0:0: [sdb] Attached SCSI disk [600585.008072] sd 2:0:0:0: Attached scsi generic sg2 type 0
You can now use the disk as a normal SCSI disk.
Discover the second storage server.
iscsiadm -m discovery -t st -p 192.168.1.120
Target found:
192.168.1.120:3260,1 iqn.2006-01.com.openfiler:linuxtest1lv-2
Map the target.
iscsiadm -m node -T iqn.2006-01.com.openfiler:linuxtest1lv-2 192.168.1.120 --login
Make persistent across reboots
The discovered nodes will automatically show up under /etc/iscsi/nodes. If you wish to make them available automatically after reboot, change the following line in the corresponding node file:
node.conn[0].startup = manual
Change to:
node.conn[0].startup = automatic
Partition with fdisk (optional)
I partitioned the disks with fdisk. This is optional, but I like to do it because it makes easier to detect the type of the disk just by checking the partition table.
Disk /dev/sdb: 2147 MB, 2147483648 bytes 67 heads, 62 sectors/track, 1009 cylinders Units = cylinders of 4154 * 512 = 2126848 bytes Disk identifier: 0x32d429c4 Device Boot Start End Blocks Id System /dev/sdb1 1 1009 2095662 8e Linux LVM Disk /dev/sdc: 2147 MB, 2147483648 bytes 67 heads, 62 sectors/track, 1009 cylinders Units = cylinders of 4154 * 512 = 2126848 bytes Disk identifier: 0x9823ed68 Device Boot Start End Blocks Id System /dev/sdc1 1 1009 2095662 8e Linux LVM
The LVM Part
Install Logical Volume Manager.
aptitude install lvm2
Create physical volumes and the volume group.
pvcreate /dev/sdb1 pvcreate /dev/sdc1 vgcreate vg0 /dev/sdb1 /dev/sdc1
Create a mirrored logical volume.
lvcreate --mirrors 1 --corelog --name testlv --size 512M vg0
Create filesystem and mount.
mke2fs -j /dev/vg0/testlv mount /dev/vg0/testlv /mnt/test
Speed
Test read speeds.
hdparm -t /dev/mapper/vg0-testlv
10 MB per second is about the max I can get with this test system which uses 100 Mbit/s ethernet.
/dev/mapper/vg0-testlv: Timing buffered disk reads: 32 MB in 3.22 seconds = 9.95 MB/sec
On a production system, gigabit is a must (preferably multiple links bonded).
Status of the Mirrored Logical Volume
To check the status of the mirrored logical volume, run the command "lvs":
LV VG Attr LSize Origin Snap% Move Log Copy% Convert testlv vg0 mwi-ao 512,00M 100,00
The Copy% will show the percentage of copied extents. 100% indicates the mirrors are synced. Whenever a mirror is out-of-sync and is being updated, the percentage will be less.
The commands "lvdisplay -m" and "pvdisplay -m" will show you a detailed map of the extents on the physical volumes:
lvdisplay -m
--- Logical volume ---
LV Name /dev/vg0/testlv
VG Name vg0
LV UUID 5ookbu-qJ9h-rzBA-D6Ek-mkH2-Vryc-EYYqvp
LV Write Access read/write
LV Status available
# open 1
LV Size 512,00 MB
Current LE 128
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 252:2
--- Segments ---
Logical extent 0 to 127:
Type mirror
Mirrors 2
Mirror size 128
Mirror region size 512,00 KB
Mirror original:
Logical volume testlv_mimage_0
Logical extents 0 to 127
Mirror destinations:
Logical volume testlv_mimage_1
Logical extents 0 to 127
pvdisplay -m
--- Physical volume --- PV Name /dev/sdb1 VG Name vg0 PV Size 2,00 GB / not usable 2,54 MB Allocatable yes PE Size (KByte) 4096 Total PE 511 Free PE 383 Allocated PE 128 PV UUID JINpaF-WiCp-sEH2-2PcK-bEvR-ht8j-mRAg05 --- Physical Segments --- Physical extent 0 to 127: Logical volume /dev/vg0/testlv_mimage_0 Logical extents 0 to 127 Physical extent 128 to 510: FREE --- Physical volume --- PV Name /dev/sdc1 VG Name vg0 PV Size 2,00 GB / not usable 2,54 MB Allocatable yes PE Size (KByte) 4096 Total PE 511 Free PE 383 Allocated PE 128 PV UUID V7dMTV-gWLe-gRWy-H7LU-7mwI-LsBu-2uIC7C --- Physical Segments --- Physical extent 0 to 127: Logical volume /dev/vg0/testlv_mimage_1 Logical extents 0 to 127 Physical extent 128 to 510: FREE
Testing for failure
When the other iSCSI server was brought down, it took about two minutes before the iSCSI initiator gave up. After this, the mounted volume was working without problems. During the two-minute timeout countdown, some slowness and waiting was experienced.
After the iSCSI server was brought up again, the other half of the mirror was restored and synced automatically. In conclusion, I would say my mirrored logical volume can be thought of as highly available.
It seems that the timeout value can be set in the node configuration file (although I didn't test it):
node.session.timeo.replacement_timeout = 120