The file /etc/drbd.conf is read by
drbdaddm.
The file format was designed as to allow to have
a verbatim copy of the file on both nodes of the cluster.
It is highly recommended to do so in order to keep your configuration
manageable. The file /etc/drbd.conf should be the same on both nodes of the cluster. Changes to /etc/drbd.conf do not apply
immediately.
Example 1. A small drbd.conf file
global { usage-count yes; }
common { syncer { rate 10M; } }
resource r0 {
protocol C;
net {
cram-hmac-alg sha1;
shared-secret "FooFunFactory";
}
on thost1 {
device /dev/drbd1;
disk /dev/hda7;
address 10.1.1.31:7789;
meta-disk internal;
}
on thost2 {
device /dev/drbd1;
disk /dev/hda7;
address 10.1.1.32:7789;
meta-disk internal;
}
}There may be multiple resource sections in a single drbd.conf file. For more examples please have a look at the DRBD Quickstart Guide.
The file consists of sections and parameters. A section begins with a keyword, sometimes an additional name, and an opening brace ("{"). A section ends with a closing brace ("}". The braces enclose the parameters.
section [name] { parameter value; [...] }
A parameter starts with the identifier of the parameter followed by whitespace. Every subsequent character is considered as part of the parameters value. A special case are Boolean parameters which only consist of the identifier. Parameters are terminated by a semicolon (";").
Some parameter values have default units which might be overruled by K, M or G. These units are defined in the usual way (K = 2^10 = 1024, M = 1024 K, G = 1024 M).
Comments may be placed into the configuration file and must begin with a hash sign ("#"). Subsequent characters are ignored until the end of the line.
skip
Comments out chunks of text, even spanning more than one line.
Characters between the keyword skip and the opening
brace ("{")are ignored. Everything enclosed by the braces
is skipped.
This comes in handy, if you just want to comment out
some 'resource [name] {...}' section: just precede it with 'skip'.
global
Configures some global parameters. Currently only
minor-count, dialog-refresh,
disable-ip-verification and usage-count
are allowed here. You may only have one global section, preferably
as the first section.
common
All resources interhit the options set in this section.
The common section might have a
a startup,
a syncer,
a handlers,
a net and a disk section.
resource name
Configures a DRBD resource.
Each resource section needs to have two
on host sections
and may have
a startup,
a syncer,
a handlers,
a net and a disk section.
Required parameter in this section: protocol.
on host-name
Carries the necessary configuration parameters for a DRBD
device of the enclosing resource.
host-name is mandatory and must match the
linux hostname (uname -n) of one of the nodes.
Required parameters in this section: device,
disk, address, meta-disk,
flexible-meta-disk.
disk
This section is used to fine tune DRBD's properties
in respect to the low level storage. Please
refer to drbdsetup(8) for detailed description of
the parameters.
Optional parameter: on-io-error,
size, fencing, use-bmbv.
net
This section is used to fine tune DRBD's properties. Please
refer to drbdsetup(8) for detailed description
of this section's parameters.
Optional parameters:
sndbuf-size, timeout,
connect-int, ping-int,
ping-timeout,
max-buffers, max-epoch-size,
ko-count, allow-two-primaries,
cram-hmac-alg, shared-secret,
after-sb-0pri, after-sb-1pri,
after-sb-2pri
startup
This section is used to fine tune DRBD's properties. Please
refer to drbdsetup(8) for detailed description
of this section's parameters.
Optional parameters:
wfc-timeout, degr-wfc-timeout.
syncer
This section is used to fine tune the synchronisation daemon
for the device. Please
refer to drbdsetup(8) for detailed description
of this section's parameters.
Optional parameters:
rate, after, al-extents.
handlers
In this section can define handlers (executables) that are executed
by the DRBD system in response to certain events.
Optional parameters:
pri-on-incon-degr, pri-lost-after-sb,
pri-lost, outdate-peer,
local-io-error.
minor-count countcount may be a number from 1 to 255.
Use minor-count if you want to define massively more resources later without reloading the DRBD kernel module. Per default the module loads with 11 more than you have currently in your config but at least 32.
dialog-refresh timetime may be 0 or a positive number.
The user dialog redraws the second count every time seconds (or does no redraws if time is zero). The default is 1.
disable-ip-verificationUse disable-ip-verification if, for some obscure reasons, drbdadm can/might not use ip or ifconfig to do a sanity check for the IP address, you can disable it with this this option.
usage-count valPlease participate in
DRBD's online usage counter.
The most convenient way to do so
is to set this option to yes. Valid options are:
yes, no and ask.
protocol prot-idOn the TCP/IP link the specified protocol is used. Valid protocol specifiers are A, B, and C.
Protocol A: write IO is reported as completed, if it has reached local disk and local TCP send buffer.
Protocol B: write IO is reported as completed, if it has reached local disk and remote buffer cache.
Protocol C: write IO is reported as completed, if it has reached both local and remote disk.
incon-degr-cmd commandIn case a node starts up in degraded mode (degr-wfc-timeout is set) and its local replica of the data is inconsistent it executes the command. If the command exits without error, drbddisk expects the DRBD device to be in primary state.
device name
The name of the block device node of the resource being described.
You must use this device with your application (file system) and
you must not use the low level block device which is specified with the
disk parameter.
The device nodes must have the same major number as the DRBD
driver has. With the current implementation major 147 is used
and the corresponding device nodes are usually named
/dev/drbd0, /dev/drbd1, etc.
( All releases before drbd-0.7.1 used major 43 and the device
files /dev/nb*. )
Installation scripts of the DRBD package provide, that
/dev/drbd0 to /dev/drbd8 are
predefined in your system. To be sure, issue something like ls /dev/drbd*.
disk nameDRBD uses this block device to actually store and retrieve the data. Never access such a device while DRBD is running on top of it. This holds also true for dumpe2fs(8) and similar commands.
address IP:portA resource needs one IP address per device, which is used to wait for incoming connections from the partner device respectively to reach the partner device.
Each DRBD resource needs a TCP port which is used to connect to the node's partner device. Two different DRBD resources may not use the same IP:port combination on the same node.
meta-disk internal, flexible-meta-disk internal, meta-disk device [index], flexible-meta-disk device
internal means, that the last part of the backing device are used to store
the meta-data. You must not use [index] with
internal. Note: Regardless if you use the meta-disk or
the flexible-meta-disk keyword, it will always be of
the size needed for the remaining storage size.
You can use a single block device to store meta-data of multiple DRBD devices. E.g. use meta-disk /dev/hde6[0]; and meta-disk /dev/hde6[1]; for two different resources. In this case the meta-disk would need to be at least 256 MB in size.
With the flexible-meta-disk keyword you specify
a block device as meta-data storage. You usually use this whith LVM,
which allows you to have many variable sized block devices.
The required size of the meta-disk block device is
36kB + Backing-Storage-size / 32k. Round this number to the next 4kb
boundary up and you have the exact size.
Rule of the thumb: 32kByte per 1GByte of storage, round up to the next
MB.
on-io-error handlerhandler is taken, if the lower level device reports io-error to the upper layers.
handler may be pass_on, call-local-io-error or detach.
pass_on: Report the io-error to the upper layers. On Primary report it to the mounted file system. On Secondary ignore it.
call-local-io-error: Thall the handler script
local-io-error.
detach: The node drops its low level device, and continues in disk less mode.
fencing fencing_policy
Under fencing we understand preventative
measures to avoid situations where both nodes are primary
and disconnected (AKA split brain).
Valid fencing policies are:
dont-careThis is the default policy. No fencing actions are untertaken.
resource-onlyIf a node becomes a disconnected primary it tries to outdate the peer's disk. This is done by calling the outdate-peer handler. The handler is supposed to reach the other node over alternative communication pathes and call 'drbdadm outdate res' there.
resource-and-stonith If a node becomes a disconnected primary it freezes all
its IO operations and calls its outdate-peer handler. The
outdate-peer hander is supposed to reach the peer over
alternative communicaton pathes and call 'drbdadm outdate
res' there. In case it can not reach the peer it should
stonith the peer. IO is resumed as soon as the situation
is resolved. In case your handler fails you can resume
IO with the resume-io command.
use-bmbvIn case the backing storage's driver has a merge_bvec_fn() function (At time of writing the only known drivers which have such a function are: md (software raid driver), dm (device mapper - LVM) and DRBD itself) drbd has to pretend that it can only process IO requests in units not lager than 4kByte.
To get best performance out of DRBD on top of software raid (or any other driver with a merge_bvec_fn() function) you might enable this function, iff you know for sure that the merge_bvec_fn() function will deliver the same results on all nodes of your cluster. I.e. the physical disks of the software raid are of the exact same type. USE THIS OPTION ONLY IF YOU KNOW WHAT YOU ARE DOING.
sndbuf-size sizesize is size of the TCP socket send buffer. Default is 128K. You can specify smaller or larger values. Larger values are appropriate for reasonable write throughput with protocol A over high latency networks. Very large values like 1M may cause problems. Even values below 32K do not make much sense.
timeout timeIf the partner node fails to send an expected response packet within time 10ths of a second, the partner node is considered dead and therefore the TCP/IP connection is abandoned. This must be lower than connect-int and ping-int. The default value is 60 = 6 seconds, the unit 0.1 seconds.
connect-int timeIn case it is not possible to connect to the remote DRBD device immediately, DRBD keeps on trying to connect. With this option you can set the time between two tries. The default value is 10 seconds, the unit is 1 second.
ping-int timeIf the TCP/IP connection linking a DRBD device pair is idle for more than time seconds, DRBD will generate a keep-alive packet to check if its partner is still alive. The default is 10 seconds, the unit is 1 second.
ping-timeout timeThe time the peer has time to answer to a keep-alive packet, it. In case the peer's reply is not received within this time period, it is considered as dead. The default is 500ms, the default unit is 100ms.
max-buffers numberMaximal number of requests to be allocated by DRBD. Unit is PAGE_SIZE, which is 4 KB on most systems. The minimum is hardcoded to 32 (=128 KB). For high performance installations it might help, if you increase that number. These buffers are used to hold datablocks while they are written to disk.
ko-count numberIn case the secondary node fails to complete a single write request for count times the timeout, it is expelled from the cluster. (I.e. the primary node goes into StandAlone mode.) The default is 0, which disables this feature.
max-epoch-size numberThe highest number of data blocks between two write barriers. If you set this smaller than 10 you might decrease your performance.
allow-two-primariesWith this option set you might make both nodes primary. You only should use this options if you use a shared storage file system on top of DRBD. At the time of writing the only ones are: OCFS2 and GFS. If you use this option with any other filesystem you are goint to crash your nodes and to corrupt your data!
unplug-watermark numberWhen the number of pending write requests on the standby (secondary) node exceeds the unplug-watermark, we trigger the request processing of our backing storage device. Some storage controllers deliver better performance with small values, other deliver best performance when it is set to the same value as max-buffers. Minimum 16, default 128, maximum 131072.
cram-hmac-algYou need to specifying the HMAC algorithm to enable peer authentication at all. It is strongly enouraged to use peer authentication. The HMAC algorithm which will be used for the challenge response authentication of the peer. You might specify any digest algorithm that is named in /proc/crypto.
shared-secretThe shared secret used in peer authentication. May be up to 64 characters.
after-sb-0pri policypossible policies are:
disconnectNo automatic resynchronisation, simply disconnect.
discard-younger-primaryAuto sync from the node that was primary before the split brain situation happened.
discard-older-primaryAuto sync from the node that became primary as second during the split brain situation.
discard-zero-changesIn case one node did not write anything since the split brain became evident, sync from the node that wrote something to the node that did not write anything. In case none wrote anything this policy uses a random decission to perform a "resync" of 0 blocks. In case both have written something this policy disconnects the nodes.
discard-least-changesAuto sync from the node that touched more blocks during the split brain situation.
discard-node-NODENAMEAuto sync to the named node.
after-sb-1pri policypossible policies are:
disconnectNo automatic resynchronisation, simply disconnect.
consensus Discard the version of the secondary if the outcome
if the after-sb-0pri algorithm would also
destroy the current secondary's data. Otherwise disconnect.
violently-as0p Always take the decission of the after-sb-0pri
algorithm. Even if that causes case an erratic change of
the primarie's view of the data. This is only usefull if
you use an 1node FS (i.e. not OCFS2 or GFS) with the
allow-two-primaries flag, _AND_ you really know what you
are doing. This is DANGEROUS and MAY CRASH YOUR MACHINE
if you have a FS mounted on the primary node.
discard-secondaryDiscard the secondary's version.
call-pri-lost-after-sb Always honour the outcome of the after-sb-0pri
algorithm. In case it decides the the current
secondary has the right data, it calls the "pri-lost-after-sb"
handler on the current primary.
after-sb-2pri policypossible policies are:
disconnectNo automatic resynchronisation, simply disconnect.
violently-as0p Always take the decission of the after-sb-0pri
algorithm. Even if that causes case an erratic change of
the primarie's view of the data. This is only usefull if
you use an 1node FS (i.e. not OCFS2 or GFS) with the
allow-two-primaries flag, _AND_ you really know what you
are doing. This is DANGEROUS and MAY CRASH YOUR MACHINE
if you have a FS mounted on the primary node.
call-pri-lost-after-sbCall the "pri-lost-after-sb" helper program on one of the machines. This program is expected to reboot the machine. (I.e. make it secondary.)
always-asbpNormally the automatic after-split-brain policies are only used if current state of the UUIDs do not indicate the presence of a third node.
With this option you request that the automatic after-split-brain policies are used as long as the data sets of the nodes are somehow related. This might cause a full sync, if the UUIDs indicates the presence of a third node. (Or double faults led to strange UUID sets.)
rr-conflict policyTo solve the cases when the outcome of the resync decission is incompatible to the current role assignment in the cluster.
disconnectNo automatic resynchronisation, simply disconnect.
violentlySync to the primary node is allowed, violating the assumption that data on a block device is stable for one of the nodes. DANGEROUS, DO NOT USE.
call-pri-lostCall the "pri-lost" helper program on one of the machines. This program is expected to reboot the machine. (I.e. make it secondary.)
wfc-timeout timeWait for connection timeout. The init script drbd(8) blocks the boot process until the DRBD resources are connected. This is so when the cluster manager starts later, it does not see a resource with internal split-brain. In case you want to limit the wait time, do it here. Default is 0, which means unlimited. Unit is seconds.
degr-wfc-timeout timeWait for connection timeout, if this node was a degraded cluster. In case a degraded cluster (= cluster with only one node left) is rebooted, this timeout value is used instead of wfc-timeout, because the peer is less likely to show up in time, if it had been dead before. Default is 60, unit is seconds. Value 0 means unlimited.
rate rateTo ensure smooth operation of the application on top of DRBD, it is possible to limit the bandwidth which may be used by background synchronizations. The default is 250 KB/sec, the default unit is KB/sec. Optional suffixes K, M, G are allowed.
after res-nameBy default resynchronization of all devices would run in parallel. By defining an sync-after dependency the resynchronisation of this resource will start only if the resoruce res-name is already in connected state (=finished its resynchronisation).
al-extents extentsDRBD automatically performs hot area detection. With this parameter you control how big the hot area (=active set) can get. Each extent marks 4M of the backing storage (=low level device). In case a primary node leaves the cluster unexpectedly the areas covered by the active set must be resynced upon rejoin of the failed node. The data structure is stored in the meta-data area, therefore each change of the active set is a write operation to the meta-data device. A higher number of extents gives longer resync times but less updates to the meta-data. The default number of extents is 127. (Minimum: 7, Maximum: 3843)
pri-on-incon-degr cmdThis handler is called if the node is primary, degraded and the local copy of the data is inconsistent.
pri-lost-after-sb cmdThe node is currently primary, but lost the after split brain auto recovery procedure. As as consequence it should go away.
pri-lost cmdThe node is currently primary, but DRBD's algorithm thinks, that it should become sync target, as consequence it should give up its primary state.
outdate-peer cmd The handler is part of the fencing
mechanism. This handler is called in case the node needs to outdate the
peer's disk. It should use other communication pathes then DRBD's network
link.
local-io-error cmdDRBD got an IO error from the local IO subsystem.
Written by Philipp Reisner <philipp.reisner@linbit.com>
and Lars Ellenberg <lars.ellenberg@linbit.com>.