Mikko Kortelainen

Redundant iSCSI storage for Linux

Here's how to set up relatively cheap redundant iSCSI storage on Linux. The redundancy is achieved using LVM mirroring, and the storage servers consist of commodity hardware, running the OpenFiler Linux distribution, which expose their disks to the clients using iSCSI over Ethernet. The servers are completely separate entities, and the purpose of this mirroring is to keep the logical volumes available, even while one of the storage servers is down for maintenance or due to hardware failure.

Ultimately the disks of the iSCSI target servers will show up as normal SCSI disks on the client (/dev/sdb, /dev/sdc, ...). The data will be moved across the network transparently. It is preferable to use multiple gigabit network interface cards on both the initiator and the target, and bond them together for reliability and speed gain (or use Device Mapper Multipath). A separate VLAN for iSCSI traffic is recommended for security and speed. By default, the traffic is not encrypted so your disk blocks can easily be sniffed using tcpdump.

I created identical logical volumes on both OpenFiler servers and mapped them to iSCSI targets. The iSCSI initiator (client) here is an Ubuntu 9.04 desktop.

Install Open-iSCSI and map targets

On the client, install Open-iSCSI.

aptitude install open-iscsi

Run the discovery to see available targets (the IP address is the address of one of the servers).

iscsiadm -m discovery -t st -p 192.168.1.115

You should get a target list as the output.

192.168.1.115:3260,1 iqn.2006-01.com.openfiler:linuxtest1lv

Map the target to a SCSI disk.

iscsiadm -m node -T iqn.2006-01.com.openfiler:linuxtest1lv -p 192.168.1.115 --login

dmesg should now show a that a new SCSI disk was detected.

[600584.938727] scsi 2:0:0:0: Direct-Access     OPNFILER VIRTUAL-DISK     0    PQ: 0 ANSI: 4
[600584.947903] sd 2:0:0:0: [sdb] 4194304 512-byte hardware sectors: (2.14 GB/2.00 GiB)
[600584.983070] sd 2:0:0:0: [sdb] Write Protect is off
[600584.983074] sd 2:0:0:0: [sdb] Mode Sense: 77 00 00 08
[600584.988064] sd 2:0:0:0: [sdb] Write cache: disabled, read cache: disabled, doesn't support DPO or FUA
[600584.989379] sd 2:0:0:0: [sdb] 4194304 512-byte hardware sectors: (2.14 GB/2.00 GiB)
[600584.989974] sd 2:0:0:0: [sdb] Write Protect is off
[600584.989977] sd 2:0:0:0: [sdb] Mode Sense: 77 00 00 08
[600584.991359] sd 2:0:0:0: [sdb] Write cache: disabled, read cache: disabled, doesn't support DPO or FUA
[600584.991363]  sdb: unknown partition table
[600585.008012] sd 2:0:0:0: [sdb] Attached SCSI disk
[600585.008072] sd 2:0:0:0: Attached scsi generic sg2 type 0

You can now use the disk as a normal SCSI disk.

Discover the second storage server.

iscsiadm -m discovery -t st -p 192.168.1.120

Target found:

192.168.1.120:3260,1 iqn.2006-01.com.openfiler:linuxtest1lv-2

Map the target.

iscsiadm -m node -T iqn.2006-01.com.openfiler:linuxtest1lv-2 192.168.1.120 --login

Make persistent across reboots

The discovered nodes will automatically show up under /etc/iscsi/nodes. If you wish to make them available automatically after reboot, change the following line in the corresponding node file:

node.conn[0].startup = manual

Change to:

node.conn[0].startup = automatic

Partition with fdisk (optional)

I partitioned the disks with fdisk. This is optional, but I like to do it because it makes easier to detect the type of the disk just by checking the partition table.

Disk /dev/sdb: 2147 MB, 2147483648 bytes
67 heads, 62 sectors/track, 1009 cylinders
Units = cylinders of 4154 * 512 = 2126848 bytes
Disk identifier: 0x32d429c4

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1        1009     2095662   8e  Linux LVM

Disk /dev/sdc: 2147 MB, 2147483648 bytes
67 heads, 62 sectors/track, 1009 cylinders
Units = cylinders of 4154 * 512 = 2126848 bytes
Disk identifier: 0x9823ed68

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1               1        1009     2095662   8e  Linux LVM

The LVM Part

Install Logical Volume Manager.

aptitude install lvm2

Create physical volumes and the volume group.

pvcreate /dev/sdb1
pvcreate /dev/sdc1
vgcreate vg0 /dev/sdb1 /dev/sdc1

Create a mirrored logical volume.

lvcreate --mirrors 1 --corelog --name testlv --size 512M vg0

Create filesystem and mount.

mke2fs -j /dev/vg0/testlv
mount /dev/vg0/testlv /mnt/test

Speed

Test read speeds.

hdparm -t /dev/mapper/vg0-testlv

10 MB per second is about the max I can get with this test system which uses 100 Mbit/s ethernet.

/dev/mapper/vg0-testlv:
 Timing buffered disk reads:   32 MB in  3.22 seconds =   9.95 MB/sec

On a production system, gigabit is a must (preferably multiple links bonded).

Status of the Mirrored Logical Volume

To check the status of the mirrored logical volume, run the command "lvs":

LV     VG   Attr   LSize   Origin Snap%  Move Log Copy%  Convert
testlv vg0  mwi-ao 512,00M                        100,00

The Copy% will show the percentage of copied extents. 100% indicates the mirrors are synced. Whenever a mirror is out-of-sync and is being updated, the percentage will be less.

The commands "lvdisplay -m" and "pvdisplay -m" will show you a detailed map of the extents on the physical volumes:

lvdisplay -m

--- Logical volume ---
LV Name                /dev/vg0/testlv
VG Name                vg0
LV UUID                5ookbu-qJ9h-rzBA-D6Ek-mkH2-Vryc-EYYqvp
LV Write Access        read/write
LV Status              available
# open                 1
LV Size                512,00 MB
Current LE             128
Segments               1
Allocation             inherit
Read ahead sectors     auto
- currently set to     256
Block device           252:2

--- Segments ---
Logical extent 0 to 127:
  Type        mirror
  Mirrors     2
  Mirror size     128
  Mirror region size  512,00 KB
  Mirror original:
    Logical volume    testlv_mimage_0
    Logical extents   0 to 127
  Mirror destinations:
    Logical volume    testlv_mimage_1
    Logical extents   0 to 127

pvdisplay -m

--- Physical volume ---
PV Name               /dev/sdb1
VG Name               vg0
PV Size               2,00 GB / not usable 2,54 MB
Allocatable           yes
PE Size (KByte)       4096
Total PE              511
Free PE               383
Allocated PE          128
PV UUID               JINpaF-WiCp-sEH2-2PcK-bEvR-ht8j-mRAg05

--- Physical Segments ---
Physical extent 0 to 127:
  Logical volume  /dev/vg0/testlv_mimage_0
  Logical extents 0 to 127
Physical extent 128 to 510:
  FREE

--- Physical volume ---
PV Name               /dev/sdc1
VG Name               vg0
PV Size               2,00 GB / not usable 2,54 MB
Allocatable           yes
PE Size (KByte)       4096
Total PE              511
Free PE               383
Allocated PE          128
PV UUID               V7dMTV-gWLe-gRWy-H7LU-7mwI-LsBu-2uIC7C

--- Physical Segments ---
Physical extent 0 to 127:
  Logical volume  /dev/vg0/testlv_mimage_1
  Logical extents 0 to 127
Physical extent 128 to 510:
  FREE

Testing for failure

When the other iSCSI server was brought down, it took about two minutes before the iSCSI initiator gave up. After this, the mounted volume was working without problems. During the two-minute timeout countdown, some slowness and waiting was experienced.

After the iSCSI server was brought up again, the other half of the mirror was restored and synced automatically. In conclusion, I would say my mirrored logical volume can be thought of as highly available.

It seems that the timeout value can be set in the node configuration file (although I didn't test it):

node.session.timeo.replacement_timeout = 120