Tag Archives: XFS

Configuring Docker Storage on Nutanix

I have recently been looking at how best to deploy a container ecosystem on my Nutanix XCP environment. At present I can run a number of containers in a virtual machine (VM). This will give me the required networking, persistent storage and the ability to migrate between hosts. Something that containers are only just becoming capable of in many cases. So that I can scale out my container deployment within my Docker host VM, I am going to have to consider increasing  the available space within /var/lib/docker. By default, if you provide no additional storage/disk for your Docker install, loopback files get created to store your containers/images. This configuration is not particularly performant, so it’s not supported for production.  You can see below how the default setup looks…

# docker info
Containers: 0
Images: 0
Storage Driver: devicemapper
 Pool Name: docker-253:1-33883287-pool
 Pool Blocksize: 65.54 kB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 1.821 GB
 Data Space Total: 107.4 GB
 Data Space Available: 50.2 GB
 Metadata Space Used: 1.479 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.146 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.93 (2015-01-30)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.2.5-201.fc22.x86_64
Operating System: Fedora 22 (Twenty Two)
CPUs: 1
Total Memory: 993.5 MiB
Name: docker-client
ID: VHCA:JO3X:IRF5:44RG:CFZ6:WETN:YBJ2:6IL5:BNDT:FK32:KH6E:UZED

Configuring Docker Storage Options

Looking at the various methods we can use to provide dedicated block storage for Docker containers. With the Devicemapper storage driver Docker automatically creates a base thin device, this is two block devices, one for data and one for metadata. The thin device is automatically formatted with an empty filesystem on creation. This device is the base of all docker images and containers. All base images are snapshots of this device and those images are then in turn used as snapshots for other images and eventually containers. This is the Docker supported production setup. Also, by using LVM based devices as the underlying storage you are then accessing them as raw devices and no longer go through the VFS layer.

Devicemapper : direct-lvm

To begin, create two LVM devices, one for container data and another to hold metadata. By default the loopback method creates a storage pool with 100GB of space. In this example I am creating a 200G LVM volume for data and a 5G metadata volume. I prefer separate volumes where possible for performance reasons. We start by hot-adding the required Nutanix vDisks to the virtual machine (docker-directlvm) guest OS (Fedora22)

<acropolis> vm.disk_create docker-directlvm create_size=200g container=DEFAULT-CTR
DiskCreate: complete
<acropolis> vm.disk_create docker-directlvm create_size=10g container=DEFAULT-CTR
DiskCreate: complete

[root@docker-directlvm ~]# lsscsi 
 
[2:0:1:0] disk NUTANIX VDISK 0 /dev/sdb 
[2:0:2:0] disk NUTANIX VDISK 0 /dev/sdc 

The next step is to create the individual LVM volumes

# pvcreate /dev/sdb /dev/sdc
# vgcreate direct-lvm /dev/sdb /dev/sdc

# lvcreate --wipesignatures y -n data direct-lvm -l 95%VG
# lvcreate --wipesignatures y -n metadata direct-lvm -l 5%VG

If setting up a new metadata pool need to zero the first 4k to indicate empty metadata:
# dd if=/dev/zero of=/dev/direct-lvm/metadata bs=1M count=1

For sizing the metadata volume above, the rule of thumb seems to be 0.1% of the data volume. This is somewhat anecdotal, so size with a little headroom perhaps? Next, start the Docker daemon using the required options in the file /etc/sysconfig/docker-storage.

DOCKER_STORAGE_OPTIONS="--storage-opt dm.datadev=/dev/direct-lvm/data --storage-opt \
dm.metadatadev=/dev/direct-lvm/metadata --storage-opt dm.fs=xfs"

You can then verify that the requested underlying storage is in use, with the docker info command

# docker info
Containers: 5
Images: 2
Storage Driver: devicemapper
 Pool Name: docker-253:1-33883287-pool
 Pool Blocksize: 65.54 kB
 Backing Filesystem: xfs
 Data file: /dev/direct-lvm/data
 Metadata file: /dev/direct-lvm/metadata
 Data Space Used: 10.8 GB
 Data Space Total: 199.5 GB
 Data Space Available: 188.7 GB
 Metadata Space Used: 7.078 MB
 Metadata Space Total: 10.5 GB
 Metadata Space Available: 10.49 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Library Version: 1.02.93 (2015-01-30)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.2.5-201.fc22.x86_64
Operating System: Fedora 22 (Twenty Two)
CPUs: 1
Total Memory: 993.5 MiB
Name: docker-directlvm
ID: VHCA:JO3X:IRF5:44RG:CFZ6:WETN:YBJ2:6IL5:BNDT:FK32:KH6E:UZED

All well and good so far, but the storage options to expose data and metadata locations namely dm.datadev and dm.metadatadev have been deprecated in favour of a preferred model, which is to have a thin pool reserved outside of Docker and passed to the daemon via the dm.thinpooldev storage option.  There’s a helper script in some Linux distros called  /etc/sysconfig/docker-storage-setup. This does all the heavy lifting, you just need to supply a device  and, or a volume group name.

Devicemapper: Thinpool

Once again start by creating the virtual device – a Nutanix vDisk –  and adding it to the virtual machine guest OS

<acropolis> vm.disk_create docker-thinp create_size=200g container=DEFAULT-CTR
DiskCreate: complete

[root@localhost sysconfig]# lsscsi
[0:0:0:0] cd/dvd QEMU QEMU DVD-ROM 1.5. /dev/sr0
[2:0:0:0] disk NUTANIX VDISK 0 /dev/sda
[2:0:1:0] disk NUTANIX VDISK 0 /dev/sdd

Edit the file /etc/sysconfig/docker-storage-setup as follows:

root@localhost sysconfig]# cat /etc/sysconfig/docker-storage-setup
# Edit this file to override any configuration options specified in
# /usr/lib/docker-storage-setup/docker-storage-setup.
#
# For more details refer to "man docker-storage-setup"
DEVS=/dev/sdd
VG=docker

Then run the storage helper script:

[root@docker-thinp ~]# pvcreate /dev/sdd
 Physical volume "/dev/sdd" successfully created

[root@docker-thinp ~]# vgcreate docker /dev/sdd 
 Volume group "docker" successfully created 

[root@docker-thinp ~]# docker-storage-setup 
Rounding up size to full physical extent 192.00 MiB 
Logical volume "docker-poolmeta" created. 
Wiping xfs signature on /dev/docker/docker-pool. 
Logical volume "docker-pool" created. 
WARNING: Converting logical volume docker/docker-pool and docker/docker-poolmeta to pool's data and metadata volumes. 
THIS WILL DESTROY CONTENT OF LOGICAL VOLUME (filesystem etc.) 
Converted docker/docker-pool to thin pool. 
Logical volume "docker-pool" changed.

You can then verify the underlying storage being used in the usual way

[root@docker-thinp ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT

sdd 8:48 0 186.3G 0 disk
├─docker-docker--pool_tmeta 253:5 0 192M 0 lvm
│ └─docker-docker--pool 253:7 0 74.4G 0 lvm
└─docker-docker--pool_tdata 253:6 0 74.4G 0 lvm
 └─docker-docker--pool 253:7 0 74.4G 0 lvm


[root@docker-thinp ~]# docker info
Containers: 0
Images: 0
Storage Driver: devicemapper
 Pool Name: docker-docker--pool
 Pool Blocksize: 524.3 kB
 Backing Filesystem: xfs
 Data file:
 Metadata file:
 Data Space Used: 62.39 MB
 Data Space Total: 79.92 GB
 Data Space Available: 79.86 GB
 Metadata Space Used: 90.11 kB
 Metadata Space Total: 201.3 MB
 Metadata Space Available: 201.2 MB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Library Version: 1.02.93 (2015-01-30)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.2.5-201.fc22.x86_64
Operating System: Fedora 22 (Twenty Two)
CPUs: 1
Total Memory: 993.5 MiB
Name: docker-thinp
ID: VHCA:JO3X:IRF5:44RG:CFZ6:WETN:YBJ2:6IL5:BNDT:FK32:KH6E:UZED

On completion this will have created the correct entries in /etc/sysconfig/docker-storage

[root@docker-thinp ~]# cat /etc/sysconfig/docker-storage
DOCKER_STORAGE_OPTIONS=--storage-driver devicemapper --storage-opt dm.fs=xfs \
 --storage-opt dm.thinpooldev=/dev/mapper/docker-docker--pool

and runtime looks like...

[root@docker-thinp ~]# ps -ef | grep docker
root 8988 1 0 16:13 ? 00:00:11 /usr/bin/docker daemon --selinux-enabled
--storage-driver devicemapper --storage-opt dm.fs=xfs 
--storage-opt dm.thinpooldev=/dev/mapper/docker-docker--pool

Bear in mind that when you are changing the underlying docker storage driver or storage options similar to the examples described above.  Then, typically, the following destructive command sequence is run (be sure to backup any important data before running the following ….

$ sudo systemctl stop docker
$ sudo rm -rf /var/lib/docker 

post changes 

$ systemctl daemon-reload
$ systemctl start docker

Additional Info

http://developerblog.redhat.com/2014/09/30/overview-storage-scalability-docker/

https://jpetazzo.github.io/2014/01/29/docker-device-mapper-resize/

http://docs.docker.com/engine/reference/commandline/daemon/#storage-driver-options

Using Ansible to deploy ELK stack on Nutanix

Just recently my colleague Andrew Nelson (@vmwnelson) posted an article on setting up Ansible on the Nutanix platform. I am also using Ansible to develop playbooks and the like to deploy the ELK stack components (Elasticsearch-Logstash-Kibana) on a block here at Nutanix. My initial aim is to setup a single index in an Elasticsearch (single node for now) cluster and use Logstash to pipe in data to be indexed. On top of that I intend to use Kibana and the Marvel plugin to measure at which point my index begins to struggle (based on stuff like OS level resource consumption, etc) as viewed from Marvel.

From a virtual machine perspective I have a Fedora 22 based gold image. From this base image I clone one VM to be the Ansible master that I will run playbooks (orchestration) from, and another VM which I will deploy my ELK stack to. This second “target” VM has had 7 vDisks added to it. The idea here being that Elasticsearch (ES) can use a comma separated list of vDisks (in my case I created them as six Linear LVM volumes). These are written to in a round robin fashion by ES and so the data gets “striped”. Nutanix vDisks are already redundant so we are getting a kind of RAID 10 for free! Here’s how my disk layout looks once configured and mounted (I am using XFS as my filesystem) on the target VM:

[root@elkhost01 ~]# df -h
/dev/mapper/esdata05-esdata05 200G 271M 200G 1% /esdata/data05
/dev/mapper/esdata03-esdata03 200G 291M 200G 1% /esdata/data03
/dev/mapper/esdata04-esdata04 200G 273M 200G 1% /esdata/data04
/dev/mapper/esdata02-esdata02 200G 271M 200G 1% /esdata/data02
/dev/mapper/esdata06-esdata06 200G 291M 200G 1% /esdata/data06
/dev/mapper/eslog-eslog 100G 150M 100G 1% /var/log/elasticsearch
/dev/mapper/esdata01-esdata01 200G 279M 200G 1% /esdata/data01

and

[root@elkhost01 ~]# lvs
 LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
 esdata01 esdata01 -wi-ao---- 200.00g
 esdata02 esdata02 -wi-ao---- 200.00g
 esdata03 esdata03 -wi-ao---- 200.00g
 esdata04 esdata04 -wi-ao---- 200.00g
 esdata05 esdata05 -wi-ao---- 200.00g
 esdata06 esdata06 -wi-ao---- 200.00g
 eslog eslog -wi-ao---- 100.00g

The next step is to install and configure Ansible. First off, configure an ansible user on both the orchestration host and target host and sync ssh keys between the two – (there’s a module that does ssh key exchange in Ansible and I will cover that at some stage)  – like so:

on both VMs :

useradd ansible
passwd ansible

# generate pub and priv keys ....
ssh-keygen -t rsa

If using strictmodes (default) in sshd_config file 
then ensure correct perms on .ssh directory and files 

chmod 700 ~/.ssh 
chmod 600 ~/.ssh/authorized_keys
[ansible@elkhost01 ~]$ ls -l ~/.ssh
total 12
-rw-------. 1 ansible ansible 404 Oct 1 13:38 authorized_keys
-rw-------. 1 ansible ansible 1675 Oct 1 13:31 id_rsa
-rw-------. 1 ansible ansible 402 Oct 1 13:31 id_rsa.pub

Exchange public keys (copy into remote hosts authorized_keys file) 
for passwordless access

[ansible@ansible-host01 ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub 10.68.64.117
/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
ansible@10.68.64.126's password:

Number of key(s) added: 1

Now try logging into the machine, with: "ssh '10.68.64.117'"
and check to make sure that only the key(s) you wanted were added.

[ansible@ansible-host01 ~]$ 

[ansible@ansible-host01 ~]$ ssh 10.68.64.117
Last login: Thu Oct 1 13:38:35 2015 from 10.68.64.113
[ansible@elkhost01 ~]$

Once you have passwordless ssh configured between your hosts – go ahead and install Ansible on the orchestration host:

# yum install ansible -y

Once installed, there are a few post install steps and tests to make sure that Ansible is working. First off set up a Ansible hosts inventory file that will eventually contain all the hostnames broken out by deployment type. The default location for this file is  /etc/ansible/hosts. In this instance I have chosen to specify a non standard name/location in order keep my hosts file within my proposed playbook.

[ansible@ansible-host01 elk]$ pwd
/home/ansible/elk
[ansible@ansible-host01 elk]$ cat production
# file: production

[elastic-hosts]
10.68.64.117

[kibana-hosts]
10.68.64.117

[nginx-hosts]
10.68.64.126

And if the passwordless ssh setup is correct – we can test as follows :

[ansible@ansible-host01 elk]$ ansible all -m ping
10.68.64.117 | success >> {
 "changed": false,
 "ping": "pong"
}

Ansible machine configuration is done via playbooks, which are based on YAML syntax. There’s a great best practice guide here. I have followed that same best practice guide on playbook directory layout below …

elk
├── elastic.yml
├── group_vars
├── host_vars
├── kibana.yml
├── production 
├── roles
│   ├── common
│   │   ├── files
│   │   ├── handlers
│   │   ├── tasks
│   │   │   └── main.yml
│   │   ├── templates
│   │   └── vars
│   │   └── main.yml
│   ├── elastic
│   │   ├── files
│   │   │   └── elasticsearch.repo
│   │   ├── handlers
│   │   │   └── main.yml
│   │   ├── tasks
│   │   │   └── main.yml
│   │   ├── templates
│   │   │   ├── elasticsearch.default.j2
│   │   │   ├── elasticsearch.in.sh.j2
│   │   │   └── elasticsearch.yml.j2
│   │   └── vars
│   │   └── main.yml
│  └-- kibana
│      ├── files
│      ├── handlers
│      │   └── main.yml
│      ├── tasks
│      │   └── main.yml
│      ├── templates
│      │   └── kibana4.service.j2
│      └── vars
│      └── main.yml
├── site.yml

I am going to cover the individual roles for elasticsearch, logstash and kibana in subsequent posts. For now there’s a main site wide playbook :

[ansible@ansible-host01 elk]$ cat site.yml
---
# file: site.yml
- include: elastic.yml
- include: kibana.yml
- include: logstash.yml
#- include: log-forwarder.yml
#- include: redis.yml
#- include: nginx.yml

Which is then broken up into individual service specific playbooks :

[ansible@ansible-host01 elk]$ cat elastic.yml
---
#file: elastic.yml
- hosts: elastic-hosts
 roles:
 - common
 - elastic
[ansible@ansible-host01 elk]$ cat kibana.yml
---
#file: kibana.yml
- hosts: kibana-hosts
 roles:
 - kibana

I will discuss the individual roles and their associated tasks etc next time. For now this should be enough to get basic Ansible functionality going.

Installing MongoDB on Nutanix XCP

As part of the recent MongoDB certification of Nutanix XCP as an Infrastructure as a Service  (IaaS) platform,  I thought I might collate some of the info I have collected while working to get the certification process completed. There’s a lot of great docs over at www.mongodb.com but I want to condense everything into a series of posts. This first post will deal with the initial install of a standalone MongoDB instance.

We saw in my previous post here how to create a Linux VM and add networking and vDisks. In this instance I have added 6 x 200GB vDisks for a data volume, and an additional 2 vDisks – one for the journal volume (50GB) and one volume to hold the log file (100GB). Here’s the output from /usr/bin/lsscsi showing the disks and their SCSI assignments :

[2:0:1:0] disk NUTANIX VDISK 0 /dev/sdj
[2:0:2:0] disk NUTANIX VDISK 0 /dev/sdk
[2:0:7:0] disk NUTANIX VDISK 0 /dev/sdb
[2:0:8:0] disk NUTANIX VDISK 0 /dev/sdc
[2:0:9:0] disk NUTANIX VDISK 0 /dev/sdd
[2:0:10:0] disk NUTANIX VDISK 0 /dev/sde
[2:0:11:0] disk NUTANIX VDISK 0 /dev/sdf
[2:0:12:0] disk NUTANIX VDISK 0 /dev/sdg
[2:0:13:0] disk NUTANIX VDISK 0 /dev/sdh
[2:0:14:0] disk NUTANIX VDISK 0 /dev/sdi

Create a user/group mongod that will own the MongoDB software :

# groupadd mongod 
# useradd mongod

To install the MongoDB Enterprise packages, create a new repo with the required information and then install as MongoDB user using yum :

# pwd
/etc/yum.repos.d
# cat mongodb-enterprise.repo
[mongodb-enterprise]
name=MongoDB Enterprise Repository
baseurl=https://repo.mongodb.com/yum/redhat/$releasever/mongodb-enterprise/stable/$basearch/
gpgcheck=0
enabled=1
$ sudo yum install -y mongodb-enterprise

We use LVM to create a 6 column striped data volume. All Nutanix vDIsks are redundant (RF=2) so to create a RAID10 data volume just stripe the vDisks, and then create 2 further linear volumes. First create the underlying physical volumes :

# lsscsi | awk '{print $6}' | grep /dev/sd | grep -v sda | xargs pvcreate
 Physical volume "/dev/sdb" successfully created
 Physical volume "/dev/sdc" successfully created
 Physical volume "/dev/sdd" successfully created
 Physical volume "/dev/sde" successfully created
 Physical volume "/dev/sdf" successfully created
 Physical volume "/dev/sdg" successfully created
 Physical volume "/dev/sdh" successfully created
 Physical volume "/dev/sdi" successfully created
 Physical volume "/dev/sdj" successfully created

Then create both the volume groups and the required volumes

vgcreate mongodata /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg 
vgcreate mongojournal /dev/sdh 
vgcreate mongolog /dev/sdi
# lvcreate -i 6 -l 100%VG -n mongodata mongodata
# lvcreate -l 100%VG -n mongojournal mongojournal 
# lvcreate -l 100%VG -n mongolog mongolog

Create an XFS filesystem on each volume:

mkfs.xfs /dev/mapper/mongodata-mongodata
mkfs.xfs /dev/mapper/mongojournal-mongojournal
mkfs.xfs /dev/mapper/mongolog-mongolog

Create the required mountpoints:

mkdir -p /mongodb/data mongodb/journal /mongodb/log

Mount the filesystems – setting noatime option on the data volume

/dev/mapper/mongodata-mongodata /mongodb/data xfs defaults,auto,noatime,noexec 0 0
/dev/mapper/mongojournal-mongojournal /mongodb/journal xfs defaults,auto,noexec 0 0
/dev/mapper/mongolog-mongolog /mongodb/log xfs defaults,auto,noexec 0 0

Set up a  soft link to re-direct the journal I/O to a separate volume:

# ln -s /mongodb/journal /mongodb/data/journal
...
lrwxrwxrwx. 1 root root 21 Nov 21 14:13 journal -> /mongodb/journal
...

At this point set the filesystem ownership to the MongoDB user:

# chown -R mongod:mongod /mongodb/data mongodb/journal mongodb/log

Prior to starting MongoDB there are a few well known best practices that need to be adhered to. Firstly, we reduce the read ahead on the data volume in order to avoid filling RAM with unwanted pages of data. MongoDB documents are quite small and a large readahead figure will fill RAM with additional pages of data that will have to then be evicted to make room for other required pages. Filling virtual memory with this superfluous data can have an adverse effect on performance. Usual recommendation is to start with a setting of 16K (32 * 512M sectors) and then adjust upwards from there.

rwxrwxrwx. 1 root root 7 Feb 4 11:50 /dev/mapper/mongodata-mongodata -> ../dm-3 

# blockdev --setra 32 /dev/dm-3
# blockdev --getra /dev/dm-3
32

MongoDB recommends that you disable transparent huge pages, edit your startup files as follows :

 #disable THP at boot time
 if test -f /sys/kernel/mm/redhat_transparent_hugepage/enabled; then
 echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
 fi
 if test -f /sys/kernel/mm/redhat_transparent_hugepage/defrag; then
 echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag
 fi

Set swappiness = 1: MongoDB is a memory-based database; if the nodes are sized correctly, then we won’t need to swap. However, setting swappiness=0 could cause unexpected invocations of the OOM (Out of Memory) killer in certain Linux distros.

$ sudo sysctl vm.swappiness=1 (for current runtime)
$ sudo echo 'vm.swappiness=1' >> /etc/sysctl.conf (make permanent)

Disable NUMA, either in VM BIOS or, invoke mongod with NUMA disabled. All supported versions of MongoDB ship with an init script that automates this as follows:

numactl –interleave=all /usr/bin/mongod –f /etc/mongod.conf

Also ensure:

$ sudo cat /proc/sys/vm/zone_reclaim_mode
0

Finally, once you have configured the /etc/mongod.conf file (as root), you can start the mongod service –  see output from grep -v ^# /etc/mongod.conf below. Note, I have added the address for the primary NIC interface to the bind_ip in addition to the local loopback.

logpath=/mongodb/log/mongod.log 
logappend=true
fork=true
dbpath=/mongodb/data
pidfilepath=/var/run/mongodb/mongod.pid
bind_ip=127.0.0.1,10.68.64.110
sudo service mongod start

Once the database has started then you can connect via the mongo shell and verify the database is up and running :

$ mongo
MongoDB shell version: 3.0.3
connecting to: test
>

Now that we have our mongodb instance installed, we can use it as a template to clone additional MongoDB hosts on demand. I will cover this in future posts when I create replica sets and shards etc. For now, we need to get some data loaded and perform a few CRUD operations and perform some additional testing. I’ll cover this in my next post.