drbd.conf

Name

drbd.conf -- Configuration file for DRBD's devices

Introduction

The file /etc/drbd.conf is read by drbdaddm.

The file format was designed as to allow to have a verbatim copy of the file on both nodes of the cluster. It is highly recommended to do so in order to keep your configuration manageable. The file /etc/drbd.conf should be the same on both nodes of the cluster. Changes to /etc/drbd.conf do not apply immediately.

Example 1. A small drbd.conf file

global { usage-count yes; }
common { syncer { rate 10M; } }
resource r0 {
	protocol C;
	net {
		cram-hmac-alg sha1;
		shared-secret "FooFunFactory";
	}
	on thost1 {
		device    /dev/drbd1;
		disk      /dev/hda7;
		address   10.1.1.31:7789;
		meta-disk  internal;
	}
	on thost2 {
		device    /dev/drbd1;
		disk      /dev/hda7;
		address   10.1.1.32:7789;
		meta-disk  internal;
	}
}
In this example there is a single DRBD resource (called r0) which uses protocol C for the connection between its devices. The device which runs on host thost1 uses /dev/drbd1 as devices for its application, and /dev/hda7 as low level storage for the data. The IP addresses are used to specify the networking interfaces to use. An eventually running resyncprocess should use about 10MByte/second of IO bandwith.

There may be multiple resource sections in a single drbd.conf file. For more examples please have a look at the DRBD Quickstart Guide.

File Format

The file consists of sections and parameters. A section begins with a keyword, sometimes an additional name, and an opening brace ("{"). A section ends with a closing brace ("}". The braces enclose the parameters.

section [name] { parameter value; [...] }

A parameter starts with the identifier of the parameter followed by whitespace. Every subsequent character is considered as part of the parameters value. A special case are Boolean parameters which only consist of the identifier. Parameters are terminated by a semicolon (";").

Some parameter values have default units which might be overruled by K, M or G. These units are defined in the usual way (K = 2^10 = 1024, M = 1024 K, G = 1024 M).

Comments may be placed into the configuration file and must begin with a hash sign ("#"). Subsequent characters are ignored until the end of the line.

Sections

skip

Comments out chunks of text, even spanning more than one line. Characters between the keyword skip and the opening brace ("{")are ignored. Everything enclosed by the braces is skipped. This comes in handy, if you just want to comment out some 'resource [name] {...}' section: just precede it with 'skip'.

global

Configures some global parameters. Currently only minor-count, dialog-refresh, disable-ip-verification and usage-count are allowed here. You may only have one global section, preferably as the first section.

common

All resources interhit the options set in this section. The common section might have a a startup, a syncer, a handlers, a net and a disk section.

resource name

Configures a DRBD resource. Each resource section needs to have two on host sections and may have a startup, a syncer, a handlers, a net and a disk section. Required parameter in this section: protocol.

on host-name

Carries the necessary configuration parameters for a DRBD device of the enclosing resource. host-name is mandatory and must match the linux hostname (uname -n) of one of the nodes. Required parameters in this section: device, disk, address, meta-disk, flexible-meta-disk.

disk

This section is used to fine tune DRBD's properties in respect to the low level storage. Please refer to drbdsetup(8) for detailed description of the parameters. Optional parameter: on-io-error, size, fencing, use-bmbv.

net

This section is used to fine tune DRBD's properties. Please refer to drbdsetup(8) for detailed description of this section's parameters. Optional parameters: sndbuf-size, timeout, connect-int, ping-int, ping-timeout, max-buffers, max-epoch-size, ko-count, allow-two-primaries, cram-hmac-alg, shared-secret, after-sb-0pri, after-sb-1pri, after-sb-2pri

startup

This section is used to fine tune DRBD's properties. Please refer to drbdsetup(8) for detailed description of this section's parameters. Optional parameters: wfc-timeout, degr-wfc-timeout.

syncer

This section is used to fine tune the synchronisation daemon for the device. Please refer to drbdsetup(8) for detailed description of this section's parameters. Optional parameters: rate, after, al-extents.

handlers

In this section can define handlers (executables) that are executed by the DRBD system in response to certain events. Optional parameters: pri-on-incon-degr, pri-lost-after-sb, pri-lost, outdate-peer, local-io-error.

Parameters

minor-count count

count may be a number from 1 to 255.

Use minor-count if you want to define massively more resources later without reloading the DRBD kernel module. Per default the module loads with 11 more than you have currently in your config but at least 32.

dialog-refresh time

time may be 0 or a positive number.

The user dialog redraws the second count every time seconds (or does no redraws if time is zero). The default is 1.

disable-ip-verification

Use disable-ip-verification if, for some obscure reasons, drbdadm can/might not use ip or ifconfig to do a sanity check for the IP address, you can disable it with this this option.

usage-count val

Please participate in DRBD's online usage counter. The most convenient way to do so is to set this option to yes. Valid options are: yes, no and ask.

protocol prot-id

On the TCP/IP link the specified protocol is used. Valid protocol specifiers are A, B, and C.

Protocol A: write IO is reported as completed, if it has reached local disk and local TCP send buffer.

Protocol B: write IO is reported as completed, if it has reached local disk and remote buffer cache.

Protocol C: write IO is reported as completed, if it has reached both local and remote disk.

incon-degr-cmd command

In case a node starts up in degraded mode (degr-wfc-timeout is set) and its local replica of the data is inconsistent it executes the command. If the command exits without error, drbddisk expects the DRBD device to be in primary state.

device name

The name of the block device node of the resource being described. You must use this device with your application (file system) and you must not use the low level block device which is specified with the disk parameter.

The device nodes must have the same major number as the DRBD driver has. With the current implementation major 147 is used and the corresponding device nodes are usually named /dev/drbd0, /dev/drbd1, etc. ( All releases before drbd-0.7.1 used major 43 and the device files /dev/nb*. )

Installation scripts of the DRBD package provide, that /dev/drbd0 to /dev/drbd8 are predefined in your system. To be sure, issue something like ls /dev/drbd*.

disk name

DRBD uses this block device to actually store and retrieve the data. Never access such a device while DRBD is running on top of it. This holds also true for dumpe2fs(8) and similar commands.

address IP:port

A resource needs one IP address per device, which is used to wait for incoming connections from the partner device respectively to reach the partner device.

Each DRBD resource needs a TCP port which is used to connect to the node's partner device. Two different DRBD resources may not use the same IP:port combination on the same node.

meta-disk internal, flexible-meta-disk internal, meta-disk device [index], flexible-meta-disk device

internal means, that the last part of the backing device are used to store the meta-data. You must not use [index] with internal. Note: Regardless if you use the meta-disk or the flexible-meta-disk keyword, it will always be of the size needed for the remaining storage size.

You can use a single block device to store meta-data of multiple DRBD devices. E.g. use meta-disk /dev/hde6[0]; and meta-disk /dev/hde6[1]; for two different resources. In this case the meta-disk would need to be at least 256 MB in size.

With the flexible-meta-disk keyword you specify a block device as meta-data storage. You usually use this whith LVM, which allows you to have many variable sized block devices. The required size of the meta-disk block device is 36kB + Backing-Storage-size / 32k. Round this number to the next 4kb boundary up and you have the exact size. Rule of the thumb: 32kByte per 1GByte of storage, round up to the next MB.

on-io-error handler

handler is taken, if the lower level device reports io-error to the upper layers.

handler may be pass_on, call-local-io-error or detach.

pass_on: Report the io-error to the upper layers. On Primary report it to the mounted file system. On Secondary ignore it.

call-local-io-error: Thall the handler script local-io-error.

detach: The node drops its low level device, and continues in disk less mode.

fencing fencing_policy

Under fencing we understand preventative measures to avoid situations where both nodes are primary and disconnected (AKA split brain).

Valid fencing policies are:

dont-care

This is the default policy. No fencing actions are untertaken.

resource-only

If a node becomes a disconnected primary it tries to outdate the peer's disk. This is done by calling the outdate-peer handler. The handler is supposed to reach the other node over alternative communication pathes and call 'drbdadm outdate res' there.

resource-and-stonith

If a node becomes a disconnected primary it freezes all its IO operations and calls its outdate-peer handler. The outdate-peer hander is supposed to reach the peer over alternative communicaton pathes and call 'drbdadm outdate res' there. In case it can not reach the peer it should stonith the peer. IO is resumed as soon as the situation is resolved. In case your handler fails you can resume IO with the resume-io command.

use-bmbv

In case the backing storage's driver has a merge_bvec_fn() function (At time of writing the only known drivers which have such a function are: md (software raid driver), dm (device mapper - LVM) and DRBD itself) drbd has to pretend that it can only process IO requests in units not lager than 4kByte.

To get best performance out of DRBD on top of software raid (or any other driver with a merge_bvec_fn() function) you might enable this function, iff you know for sure that the merge_bvec_fn() function will deliver the same results on all nodes of your cluster. I.e. the physical disks of the software raid are of the exact same type. USE THIS OPTION ONLY IF YOU KNOW WHAT YOU ARE DOING.

sndbuf-size size

size is size of the TCP socket send buffer. Default is 128K. You can specify smaller or larger values. Larger values are appropriate for reasonable write throughput with protocol A over high latency networks. Very large values like 1M may cause problems. Even values below 32K do not make much sense.

timeout time

If the partner node fails to send an expected response packet within time 10ths of a second, the partner node is considered dead and therefore the TCP/IP connection is abandoned. This must be lower than connect-int and ping-int. The default value is 60 = 6 seconds, the unit 0.1 seconds.

connect-int time

In case it is not possible to connect to the remote DRBD device immediately, DRBD keeps on trying to connect. With this option you can set the time between two tries. The default value is 10 seconds, the unit is 1 second.

ping-int time

If the TCP/IP connection linking a DRBD device pair is idle for more than time seconds, DRBD will generate a keep-alive packet to check if its partner is still alive. The default is 10 seconds, the unit is 1 second.

ping-timeout time

The time the peer has time to answer to a keep-alive packet, it. In case the peer's reply is not received within this time period, it is considered as dead. The default is 500ms, the default unit is 100ms.

max-buffers number

Maximal number of requests to be allocated by DRBD. Unit is PAGE_SIZE, which is 4 KB on most systems. The minimum is hardcoded to 32 (=128 KB). For high performance installations it might help, if you increase that number. These buffers are used to hold datablocks while they are written to disk.

ko-count number

In case the secondary node fails to complete a single write request for count times the timeout, it is expelled from the cluster. (I.e. the primary node goes into StandAlone mode.) The default is 0, which disables this feature.

max-epoch-size number

The highest number of data blocks between two write barriers. If you set this smaller than 10 you might decrease your performance.

allow-two-primaries

With this option set you might make both nodes primary. You only should use this options if you use a shared storage file system on top of DRBD. At the time of writing the only ones are: OCFS2 and GFS. If you use this option with any other filesystem you are goint to crash your nodes and to corrupt your data!

unplug-watermark number

When the number of pending write requests on the standby (secondary) node exceeds the unplug-watermark, we trigger the request processing of our backing storage device. Some storage controllers deliver better performance with small values, other deliver best performance when it is set to the same value as max-buffers. Minimum 16, default 128, maximum 131072.

cram-hmac-alg

You need to specifying the HMAC algorithm to enable peer authentication at all. It is strongly enouraged to use peer authentication. The HMAC algorithm which will be used for the challenge response authentication of the peer. You might specify any digest algorithm that is named in /proc/crypto.

shared-secret

The shared secret used in peer authentication. May be up to 64 characters.

after-sb-0pri policy

possible policies are:

disconnect

No automatic resynchronisation, simply disconnect.

discard-younger-primary

Auto sync from the node that was primary before the split brain situation happened.

discard-older-primary

Auto sync from the node that became primary as second during the split brain situation.

discard-zero-changes

In case one node did not write anything since the split brain became evident, sync from the node that wrote something to the node that did not write anything. In case none wrote anything this policy uses a random decission to perform a "resync" of 0 blocks. In case both have written something this policy disconnects the nodes.

discard-least-changes

Auto sync from the node that touched more blocks during the split brain situation.

discard-node-NODENAME

Auto sync to the named node.

after-sb-1pri policy

possible policies are:

disconnect

No automatic resynchronisation, simply disconnect.

consensus

Discard the version of the secondary if the outcome if the after-sb-0pri algorithm would also destroy the current secondary's data. Otherwise disconnect.

violently-as0p

Always take the decission of the after-sb-0pri algorithm. Even if that causes case an erratic change of the primarie's view of the data. This is only usefull if you use an 1node FS (i.e. not OCFS2 or GFS) with the allow-two-primaries flag, _AND_ you really know what you are doing. This is DANGEROUS and MAY CRASH YOUR MACHINE if you have a FS mounted on the primary node.

discard-secondary

Discard the secondary's version.

call-pri-lost-after-sb

Always honour the outcome of the after-sb-0pri algorithm. In case it decides the the current secondary has the right data, it calls the "pri-lost-after-sb" handler on the current primary.

after-sb-2pri policy

possible policies are:

disconnect

No automatic resynchronisation, simply disconnect.

violently-as0p

Always take the decission of the after-sb-0pri algorithm. Even if that causes case an erratic change of the primarie's view of the data. This is only usefull if you use an 1node FS (i.e. not OCFS2 or GFS) with the allow-two-primaries flag, _AND_ you really know what you are doing. This is DANGEROUS and MAY CRASH YOUR MACHINE if you have a FS mounted on the primary node.

call-pri-lost-after-sb

Call the "pri-lost-after-sb" helper program on one of the machines. This program is expected to reboot the machine. (I.e. make it secondary.)

always-asbp

Normally the automatic after-split-brain policies are only used if current state of the UUIDs do not indicate the presence of a third node.

With this option you request that the automatic after-split-brain policies are used as long as the data sets of the nodes are somehow related. This might cause a full sync, if the UUIDs indicates the presence of a third node. (Or double faults led to strange UUID sets.)

rr-conflict policy

To solve the cases when the outcome of the resync decission is incompatible to the current role assignment in the cluster.

disconnect

No automatic resynchronisation, simply disconnect.

violently

Sync to the primary node is allowed, violating the assumption that data on a block device is stable for one of the nodes. DANGEROUS, DO NOT USE.

call-pri-lost

Call the "pri-lost" helper program on one of the machines. This program is expected to reboot the machine. (I.e. make it secondary.)

wfc-timeout time

Wait for connection timeout. The init script drbd(8) blocks the boot process until the DRBD resources are connected. This is so when the cluster manager starts later, it does not see a resource with internal split-brain. In case you want to limit the wait time, do it here. Default is 0, which means unlimited. Unit is seconds.

degr-wfc-timeout time

Wait for connection timeout, if this node was a degraded cluster. In case a degraded cluster (= cluster with only one node left) is rebooted, this timeout value is used instead of wfc-timeout, because the peer is less likely to show up in time, if it had been dead before. Default is 60, unit is seconds. Value 0 means unlimited.

rate rate

To ensure smooth operation of the application on top of DRBD, it is possible to limit the bandwidth which may be used by background synchronizations. The default is 250 KB/sec, the default unit is KB/sec. Optional suffixes K, M, G are allowed.

after res-name

By default resynchronization of all devices would run in parallel. By defining an sync-after dependency the resynchronisation of this resource will start only if the resoruce res-name is already in connected state (=finished its resynchronisation).

al-extents extents

DRBD automatically performs hot area detection. With this parameter you control how big the hot area (=active set) can get. Each extent marks 4M of the backing storage (=low level device). In case a primary node leaves the cluster unexpectedly the areas covered by the active set must be resynced upon rejoin of the failed node. The data structure is stored in the meta-data area, therefore each change of the active set is a write operation to the meta-data device. A higher number of extents gives longer resync times but less updates to the meta-data. The default number of extents is 127. (Minimum: 7, Maximum: 3843)

pri-on-incon-degr cmd

This handler is called if the node is primary, degraded and the local copy of the data is inconsistent.

pri-lost-after-sb cmd

The node is currently primary, but lost the after split brain auto recovery procedure. As as consequence it should go away.

pri-lost cmd

The node is currently primary, but DRBD's algorithm thinks, that it should become sync target, as consequence it should give up its primary state.

outdate-peer cmd

The handler is part of the fencing mechanism. This handler is called in case the node needs to outdate the peer's disk. It should use other communication pathes then DRBD's network link.

local-io-error cmd

DRBD got an IO error from the local IO subsystem.

Version

This document was revised version 8.0 of the DRBD distribution.

Author

Written by Philipp Reisner and Lars Ellenberg .

Reporting Bugs

Report bugs to .

Copyright

Copyright 2001-2007 LINBIT Information Technologies, Philipp Reisner, Lars Ellenberg. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

See Also

drbd(8), drbddisk(8), drbdsetup(8) drbdadm(8) DRBD Homepage