Configuring Docker Storage on Nutanix

I have recently been looking at how best to deploy a container ecosystem on my Nutanix XCP environment. At present I can run a number of containers in a virtual machine (VM). This will give me the required networking, persistent storage and the ability to migrate between hosts. Something that containers are only just becoming capable of in many cases. So that I can scale out my container deployment within my Docker host VM, I am going to have to consider increasing  the available space within /var/lib/docker. By default, if you provide no additional storage/disk for your Docker install, loopback files get created to store your containers/images. This configuration is not particularly performant, so it’s not supported for production.  You can see below how the default setup looks…

# docker info
Containers: 0
Images: 0
Storage Driver: devicemapper
 Pool Name: docker-253:1-33883287-pool
 Pool Blocksize: 65.54 kB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 1.821 GB
 Data Space Total: 107.4 GB
 Data Space Available: 50.2 GB
 Metadata Space Used: 1.479 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.146 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.93 (2015-01-30)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.2.5-201.fc22.x86_64
Operating System: Fedora 22 (Twenty Two)
CPUs: 1
Total Memory: 993.5 MiB
Name: docker-client
ID: VHCA:JO3X:IRF5:44RG:CFZ6:WETN:YBJ2:6IL5:BNDT:FK32:KH6E:UZED

Configuring Docker Storage Options

Looking at the various methods we can use to provide dedicated block storage for Docker containers. With the Devicemapper storage driver Docker automatically creates a base thin device, this is two block devices, one for data and one for metadata. The thin device is automatically formatted with an empty filesystem on creation. This device is the base of all docker images and containers. All base images are snapshots of this device and those images are then in turn used as snapshots for other images and eventually containers. This is the Docker supported production setup. Also, by using LVM based devices as the underlying storage you are then accessing them as raw devices and no longer go through the VFS layer.

Devicemapper : direct-lvm

To begin, create two LVM devices, one for container data and another to hold metadata. By default the loopback method creates a storage pool with 100GB of space. In this example I am creating a 200G LVM volume for data and a 5G metadata volume. I prefer separate volumes where possible for performance reasons. We start by hot-adding the required Nutanix vDisks to the virtual machine (docker-directlvm) guest OS (Fedora22)

<acropolis> vm.disk_create docker-directlvm create_size=200g container=DEFAULT-CTR
DiskCreate: complete
<acropolis> vm.disk_create docker-directlvm create_size=10g container=DEFAULT-CTR
DiskCreate: complete

[root@docker-directlvm ~]# lsscsi 
 
[2:0:1:0] disk NUTANIX VDISK 0 /dev/sdb 
[2:0:2:0] disk NUTANIX VDISK 0 /dev/sdc 

The next step is to create the individual LVM volumes

# pvcreate /dev/sdb /dev/sdc
# vgcreate direct-lvm /dev/sdb /dev/sdc

# lvcreate --wipesignatures y -n data direct-lvm -l 95%VG
# lvcreate --wipesignatures y -n metadata direct-lvm -l 5%VG

If setting up a new metadata pool need to zero the first 4k to indicate empty metadata:
# dd if=/dev/zero of=/dev/direct-lvm/metadata bs=1M count=1

For sizing the metadata volume above, the rule of thumb seems to be 0.1% of the data volume. This is somewhat anecdotal, so size with a little headroom perhaps? Next, start the Docker daemon using the required options in the file /etc/sysconfig/docker-storage.

DOCKER_STORAGE_OPTIONS="--storage-opt dm.datadev=/dev/direct-lvm/data --storage-opt \
dm.metadatadev=/dev/direct-lvm/metadata --storage-opt dm.fs=xfs"

You can then verify that the requested underlying storage is in use, with the docker info command

# docker info
Containers: 5
Images: 2
Storage Driver: devicemapper
 Pool Name: docker-253:1-33883287-pool
 Pool Blocksize: 65.54 kB
 Backing Filesystem: xfs
 Data file: /dev/direct-lvm/data
 Metadata file: /dev/direct-lvm/metadata
 Data Space Used: 10.8 GB
 Data Space Total: 199.5 GB
 Data Space Available: 188.7 GB
 Metadata Space Used: 7.078 MB
 Metadata Space Total: 10.5 GB
 Metadata Space Available: 10.49 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Library Version: 1.02.93 (2015-01-30)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.2.5-201.fc22.x86_64
Operating System: Fedora 22 (Twenty Two)
CPUs: 1
Total Memory: 993.5 MiB
Name: docker-directlvm
ID: VHCA:JO3X:IRF5:44RG:CFZ6:WETN:YBJ2:6IL5:BNDT:FK32:KH6E:UZED

All well and good so far, but the storage options to expose data and metadata locations namely dm.datadev and dm.metadatadev have been deprecated in favour of a preferred model, which is to have a thin pool reserved outside of Docker and passed to the daemon via the dm.thinpooldev storage option.  There’s a helper script in some Linux distros called  /etc/sysconfig/docker-storage-setup. This does all the heavy lifting, you just need to supply a device  and, or a volume group name.

Devicemapper: Thinpool

Once again start by creating the virtual device – a Nutanix vDisk –  and adding it to the virtual machine guest OS

<acropolis> vm.disk_create docker-thinp create_size=200g container=DEFAULT-CTR
DiskCreate: complete

[root@localhost sysconfig]# lsscsi
[0:0:0:0] cd/dvd QEMU QEMU DVD-ROM 1.5. /dev/sr0
[2:0:0:0] disk NUTANIX VDISK 0 /dev/sda
[2:0:1:0] disk NUTANIX VDISK 0 /dev/sdd

Edit the file /etc/sysconfig/docker-storage-setup as follows:

root@localhost sysconfig]# cat /etc/sysconfig/docker-storage-setup
# Edit this file to override any configuration options specified in
# /usr/lib/docker-storage-setup/docker-storage-setup.
#
# For more details refer to "man docker-storage-setup"
DEVS=/dev/sdd
VG=docker

Then run the storage helper script:

[root@docker-thinp ~]# pvcreate /dev/sdd
 Physical volume "/dev/sdd" successfully created

[root@docker-thinp ~]# vgcreate docker /dev/sdd 
 Volume group "docker" successfully created 

[root@docker-thinp ~]# docker-storage-setup 
Rounding up size to full physical extent 192.00 MiB 
Logical volume "docker-poolmeta" created. 
Wiping xfs signature on /dev/docker/docker-pool. 
Logical volume "docker-pool" created. 
WARNING: Converting logical volume docker/docker-pool and docker/docker-poolmeta to pool's data and metadata volumes. 
THIS WILL DESTROY CONTENT OF LOGICAL VOLUME (filesystem etc.) 
Converted docker/docker-pool to thin pool. 
Logical volume "docker-pool" changed.

You can then verify the underlying storage being used in the usual way

[root@docker-thinp ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT

sdd 8:48 0 186.3G 0 disk
├─docker-docker--pool_tmeta 253:5 0 192M 0 lvm
│ └─docker-docker--pool 253:7 0 74.4G 0 lvm
└─docker-docker--pool_tdata 253:6 0 74.4G 0 lvm
 └─docker-docker--pool 253:7 0 74.4G 0 lvm


[root@docker-thinp ~]# docker info
Containers: 0
Images: 0
Storage Driver: devicemapper
 Pool Name: docker-docker--pool
 Pool Blocksize: 524.3 kB
 Backing Filesystem: xfs
 Data file:
 Metadata file:
 Data Space Used: 62.39 MB
 Data Space Total: 79.92 GB
 Data Space Available: 79.86 GB
 Metadata Space Used: 90.11 kB
 Metadata Space Total: 201.3 MB
 Metadata Space Available: 201.2 MB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Library Version: 1.02.93 (2015-01-30)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.2.5-201.fc22.x86_64
Operating System: Fedora 22 (Twenty Two)
CPUs: 1
Total Memory: 993.5 MiB
Name: docker-thinp
ID: VHCA:JO3X:IRF5:44RG:CFZ6:WETN:YBJ2:6IL5:BNDT:FK32:KH6E:UZED

On completion this will have created the correct entries in /etc/sysconfig/docker-storage

[root@docker-thinp ~]# cat /etc/sysconfig/docker-storage
DOCKER_STORAGE_OPTIONS=--storage-driver devicemapper --storage-opt dm.fs=xfs \
 --storage-opt dm.thinpooldev=/dev/mapper/docker-docker--pool

and runtime looks like...

[root@docker-thinp ~]# ps -ef | grep docker
root 8988 1 0 16:13 ? 00:00:11 /usr/bin/docker daemon --selinux-enabled
--storage-driver devicemapper --storage-opt dm.fs=xfs 
--storage-opt dm.thinpooldev=/dev/mapper/docker-docker--pool

Bear in mind that when you are changing the underlying docker storage driver or storage options similar to the examples described above.  Then, typically, the following destructive command sequence is run (be sure to backup any important data before running the following ….

$ sudo systemctl stop docker
$ sudo rm -rf /var/lib/docker 

post changes 

$ systemctl daemon-reload
$ systemctl start docker

Additional Info

http://developerblog.redhat.com/2014/09/30/overview-storage-scalability-docker/

https://jpetazzo.github.io/2014/01/29/docker-device-mapper-resize/

http://docs.docker.com/engine/reference/commandline/daemon/#storage-driver-options

ELK on Nutanix : Kibana

It might seem like I am doing things out of sequence by looking at the visualisation layer of the ELK stack next. However, recall in my original post , that I wanted to build sets  of unreplicated indexes and then use Logstash to fire test workloads at them. Hence, I am covering Elasticsearch and Kibana initially. This brings me to another technical point that I need to cover. In order for a single set of indexes to be actually recoverable, when running on a single node, we need to invoke the following parameters in our Elasticsearch playbook :

So in file: roles/elastic/vars/main.yml
...
elasticsearch_gateway.recover_after_nodes: 1
elasticsearch_gateway.recover_after_time: 5m
elasticsearch_gateway.expected_nodes: 1
...

These are then set in the elasticsearch.yml.j2 file as follows:

# file: roles/elastic/templates/elasticsearch.yml.j2
#{{ ansible_managed }}

...

# Allow recovery process after N nodes in a cluster are up:
#
#gateway.recover_after_nodes: 2
{% if elasticsearch_gateway_recover_after_nodes is defined %}
gateway.recover_after_nodes : {{ elasticsearch_gateway_recover_after_nodes}}
{% endif %}

and so on ....

This allows the indexes to be recovered when there is only a single node in the cluster. See below for the state of my indexes after a reboot:

[root@elkhost01 elasticsearch]# curl -XGET http://localhost:9200/_cluster/health?pretty
{
 "cluster_name" : "nx-elastic",
 "status" : "yellow",
 "timed_out" : false,
 "number_of_nodes" : 1,
 "number_of_data_nodes" : 1,
 "active_primary_shards" : 4,
 "active_shards" : 4,
 "relocating_shards" : 0,
 "initializing_shards" : 0,
 "unassigned_shards" : 4,
 "delayed_unassigned_shards" : 0,
 "number_of_pending_tasks" : 0,
 "number_of_in_flight_fetch" : 0
}

Lets now look at the Kibana playbook I am attempting. Unfortunately, Kibana is distributed as a compressed tar archive. This means that the yum or dnf modules are no help here. There is however a very useful unarchive module, but first we need to download the tar bundle using get_url as follows :

- name: download kibana tar file
 get_url: url=https://download.elasticsearch.org/kibana/kibana/kibana-{{ kibana_version }}-linux-x64.tar.gz
 dest=/tmp/kibana-{{ kibana_version }}-linux-x64.tar.gz mode=755
 tags: kibana

I initially tried unarchiving the Kibana bundle into /tmp. I then intended to copy everything below the version specific directory (/tmp/kibana-4.0.1-linux-x64) into the Ansible created /opt/kibana directory. This proved problematic as neither the synchronize nor the copy modules seemed setup to do mass copy/transfer between one directory structure to another. Maybe I am just not getting it – I even tried using with_item loops but no joy as fileglobs are not recursive. Answers on a postcard are always appreciated? In the end I just did this :

- name: create kibana directory
 become: true
 file: owner=kibana group=kibana path=/opt/kibana state=directory
 tags: kibana

- name: extract kibana tar file
 become: true
 unarchive: src=/tmp/kibana-{{ kibana_version }}-linux-x64.tar.gz dest=/opt/kibana copy=no
 tags: kibana

The next thing to do was to create a systemd service unit. There isn’t one for Kibana as there is no rpm package available. Usual templating applies here :

- name: install kibana as systemd service
 become: true
 template: src=kibana4.service.j2 dest=/etc/systemd/system/kibana4.service owner=root \
           group=root mode=0644
 notify:
 - restart kibana
 tags: kibana

And the service unit file looked like:

[ansible@ansible-host01 templates]$ cat kibana4.service.j2
{{ ansible_managed }}

[Service]
ExecStart=/opt/kibana/kibana-{{ kibana_version }}-linux-x64/bin/kibana
Restart=always
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=kibana4
User=root
Group=root
Environment=NODE_ENV=production

[Install]
WantedBy=multi-user.target

This all seemed to work as I could now access Kibana via my browser. No indexes yet of course :

kibana_initial_install

There are one or two plays I would like still like to document. Firstly, the ‘notify’ actions in some of the plays. These are used to call – in my case – the restart handlers. Which in turn causes the service in question to be restarted – see the next section :

# file: roles/kibana/handlers

- name: restart kibana
 become: true
 service: name=kibana state=restarted

I wanted to document this next feature simply because it’s so useful – tags. I have assigned a tag to every play/task in the playbook so far you will have noticed. For testing purposes they allow you to run specific plays. You can then troubleshoot just that particular play and see what’s going on.

 ansible-playbook -i ./production site.yml --tags "kibana" --ask-sudo-pass

Now that I have the basic plays to get my Elasticsearch and Kibana services up and running via Ansible, it’s time to start looking at Logstash. Next time I post on ELK type stuff, I will try to look at logging and search use cases. Once I crack how they work of course.

ELK on Nutanix : Elasticsearch

In this second post on using Ansible to deploy the ELK stack on Nutanix, I will cover my initial draft at a playbook for Elasticsearch (ES).  Recall from my previous post, the playbook layout looks like:

[ansible@ansible-host01 roles]$ tree elastic
elastic
├── files
│   └── elasticsearch.repo
├── handlers
│   └── main.yml
├── tasks
│   └── main.yml
├── templates
│   ├── elasticsearch.default.j2
│   ├── elasticsearch.in.sh.j2
│   └── elasticsearch.yml.j2
└── vars
 └── main.yml

There’s also an additional role at play here, config – which is the basic config for the underlying VM guest OS, which we also need to look at :

[ansible@ansible-host01 roles]$ tree config
config
├── files
├── handlers
├── tasks
│   └── main.yml
├── templates
└── vars
 └── main.yml

the common role is where I set things via the Ansible sysctl module, or add entries to files (using lineinfile) in order to set max memory and ulimits etc. It’s generic system configuration, so for example:

#installing java runtime pkgs (pre-req for ELK)
- name: install java 8 runtime
 become: true
 yum: name=java state=installed
 tags: config

#set system max/min numbers...
- name: set maximum map count in sysctl/systemd
 become: true
 sysctl: name=vm.max_map_count value={{ os_max_map_count }} state=present
 tags: config

...

- name: set soft limits for open files
 become: true
 lineinfile: dest=/etc/security/limits.conf line="{{ elasticsearch_user }} soft nofile {{ elasticsearch_max_open_files }}" insertafter=EOF backup=yes
 tags: config

- name: set max locked memory
 become: true
 lineinfile: dest=/etc/security/limits.conf line="{{ elasticsearch_user }} - memlock {{ elasticsearch_max_locked_memory }}" insertafter=EOF backup=yes
 tags: config

...

Here might be a good time to touch upon how Ansible allows you to set variables. Within the directory of each role there’s a subdir called vars and all the variables needed for that role are contained in the YAML file (main.yml). Here’s a snippet:

# can use vars to set versioning and user 
elasticsearch_version: 1.7.0
elasticsearch_user: elasticsearch
...

# here's how we can specify the data volumes that ES will use 
elasticsearch_data_dir: /esdata/data01,/esdata/data02,/esdata/data03,/esdata/data04,/esdata/data05,/esdata/data06

...

# Virtual memory settings - ES heap is set to half my current VM RAM
# but no greater than 32GB for performance reasons
elasticsearch_heap_size: 16g
elasticsearch_max_locked_memory: unlimited
elasticsearch_memory_bootstrap_mlockall: "true"

....

# Good idea not to go with the ES default names of Franz Kafka etc
elasticsearch_cluster_name: nx-elastic
elasticsearch_node_name: nx-esnode01

# My initial nodes will be both cluster quorum members and data "workhorse" nodes.
# I will # separate duties as I scale. Also I set the min master nodes to 1 so that 
# my ES cluster comes up while initially testing a single index 
elasticsearch_node_master: "true"
elasticsearch_node_data: "true"
elasticsearch_discovery_zen_minimum_master_nodes: 1

We’ll see how we use these variables as we cover more features. Next up I used some nice features like shell and also register variables to be able to provide conditional behaviour for package install :

- name: check for previous elasticsearch installation
 shell: if [ -e /usr/share/elasticsearch/lib/elasticsearch-{{ elasticsearch_version }}.jar ]; then echo yes; else echo no; fi;
 register: version_exists
 always_run: True
 tags: elastic

- name: uninstalling previous version if applicable
 become: true
 command: yum erase -y elasticsearch
 when: version_exists.stdout == 'no'
 ignore_errors: true
 tags: elastic

and similarly for the marvel plugin :

- name: check marvel plugin installed
 become: true
 stat: path={{ elasticsearch_home_dir }}/plugins/marvel
 register: marvel_installed
 tags: elastic

- name: install marvel plugin
 become: true
 command: "{{ elasticsearch_home_dir }}/bin/plugin -i elasticsearch/marvel/latest"
 notify:
 - restart elasticsearch
 when: not marvel_installed.stat.exists
 tags: elastic

The Marvel plugin stanza above also makes use of the stat module – this is a really great module. It returns all kinds of goodness you would normally expect from a stat() system call and yet you are doing it in your Ansible playbook.  There are a couple more things I will cover and then leave the rest for when I talk about Kibana and Logstash in a follow up post. First up then are templates. Ansible uses Jinja2 templating in order to transform a file and install it on your host, you can create a file with appropriate templating as below. The variables in {{ .. }} are from the roles ../var directory containing the yaml file already described earlier.

Note : I stripped all comment lines for sake of brevity:

[ansible@ansible-host01 templates]$ pwd
/home/ansible/elk/roles/elastic/templates
[ansible@ansible-host01 templates]$ grep -v ^# elasticsearch.yml.j2
{% if elasticsearch_cluster_name is defined %}
cluster.name: {{ elasticsearch_cluster_name }}
{% endif %}

...

{% if elasticsearch_node_name is defined %}
node.name: {{ elasticsearch_node_name }}
{% endif %}

...

{% if elasticsearch_node_master is defined %}
node.master: {{ elasticsearch_node_master }}
{% endif %}
{% if elasticsearch_node_data is defined %}
node.data: {{ elasticsearch_node_data }}
{% endif %}

...

{% if elasticsearch_memory_bootstrap_mlockall is defined %}
bootstrap.mlockall: {{ elasticsearch_memory_bootstrap_mlockall }}
{% endif %}
....

The template  file when run in the play is then transformed using the provided variables and copied into place on my  ELK host target VM…

- name: copy elasticsearch defaults file
 become: true
 template: src=elasticsearch.default.j2 dest=/etc/sysconfig/elasticsearch owner={{ elasticsearch_user }} group={{ elasticsearch_group }} mode=0644
 notify:
 - restart elasticsearch
 tags: elastic

So let’s see how our playbook runs and what the output looks like

[ansible@ansible-host01 elk]$ ansible-playbook -i ./production site.yml \
--tags "config,elastic" --ask-sudo-pass
SUDO password:

PLAY [elastic-hosts] **********************************************************

GATHERING FACTS ***************************************************************
ok: [10.68.64.117]

TASK: [config | install java 8 runtime] ***************************************
ok: [10.68.64.117]

TASK: [config | set swappiness in sysctl/systemd] *****************************
ok: [10.68.64.117]

TASK: [config | set maximum map count in sysctl/systemd] **********************
ok: [10.68.64.117]

TASK: [config | set hard limits for open files] *******************************
ok: [10.68.64.117]

TASK: [config | set soft limits for open files] *******************************
ok: [10.68.64.117]

TASK: [config | set max locked memory] ****************************************
ok: [10.68.64.117]

TASK: [config | Install wget package (Fedora based)] **************************
ok: [10.68.64.117]

TASK: [elastic | install elasticsearch signing key] ***************************
changed: [10.68.64.117]

TASK: [elastic | copy elasticsearch repo] *************************************
ok: [10.68.64.117]

TASK: [elastic | check for previous elasticsearch installation] ***************
changed: [10.68.64.117]

TASK: [elastic | uninstalling previous version if applicable] *****************
skipping: [10.68.64.117]

TASK: [elastic | install elasticsearch pkgs] **********************************
skipping: [10.68.64.117]

TASK: [elastic | copy elasticsearch configuration file] ***********************
ok: [10.68.64.117]

TASK: [elastic | copy elasticsearch defaults file] ****************************
ok: [10.68.64.117]

TASK: [elastic | set max memory limit in systemd file (RHEL/CentOS 7+)] *******
changed: [10.68.64.117]

TASK: [elastic | set log directory permissions] *******************************
ok: [10.68.64.117]

TASK: [elastic | set data directory permissions] ******************************
ok: [10.68.64.117]

TASK: [elastic | ensure elasticsearch running and enabled] ********************
ok: [10.68.64.117]

TASK: [elastic | check marvel plugin installed] *******************************
ok: [10.68.64.117]

TASK: [elastic | install marvel plugin] ***************************************
skipping: [10.68.64.117]

NOTIFIED: [elastic | restart elasticsearch] ***********************************
changed: [10.68.64.117]

PLAY RECAP ********************************************************************
10.68.64.117 : ok=19 changed=4 unreachable=0 failed=0

[ansible@ansible-host01 elk]$

I can verify that my ES cluster is working by querying the Cluster API – note that the red status is down to the fact I have no other cluster nodes yet on which to replicate the index shards:

# curl -XGET http://localhost:9200/_cluster/health?pretty
{
 "cluster_name" : "nx-elastic",
 "status" : "red",
 "timed_out" : false,
 "number_of_nodes" : 1,
 "number_of_data_nodes" : 1,
 "active_primary_shards" : 0,
 "active_shards" : 0,
 "relocating_shards" : 0,
 "initializing_shards" : 0,
 "unassigned_shards" : 0,
 "delayed_unassigned_shards" : 0,
 "number_of_pending_tasks" : 0,
 "number_of_in_flight_fetch" : 0
}

You can use further API queries to verify that the desired configuration is in place and at that point you have a solid, repeatable deployment with a known outcome ie: you are doing DevOps.

Using Ansible to deploy ELK stack on Nutanix

Just recently my colleague Andrew Nelson (@vmwnelson) posted an article on setting up Ansible on the Nutanix platform. I am also using Ansible to develop playbooks and the like to deploy the ELK stack components (Elasticsearch-Logstash-Kibana) on a block here at Nutanix. My initial aim is to setup a single index in an Elasticsearch (single node for now) cluster and use Logstash to pipe in data to be indexed. On top of that I intend to use Kibana and the Marvel plugin to measure at which point my index begins to struggle (based on stuff like OS level resource consumption, etc) as viewed from Marvel.

From a virtual machine perspective I have a Fedora 22 based gold image. From this base image I clone one VM to be the Ansible master that I will run playbooks (orchestration) from, and another VM which I will deploy my ELK stack to. This second “target” VM has had 7 vDisks added to it. The idea here being that Elasticsearch (ES) can use a comma separated list of vDisks (in my case I created them as six Linear LVM volumes). These are written to in a round robin fashion by ES and so the data gets “striped”. Nutanix vDisks are already redundant so we are getting a kind of RAID 10 for free! Here’s how my disk layout looks once configured and mounted (I am using XFS as my filesystem) on the target VM:

[root@elkhost01 ~]# df -h
/dev/mapper/esdata05-esdata05 200G 271M 200G 1% /esdata/data05
/dev/mapper/esdata03-esdata03 200G 291M 200G 1% /esdata/data03
/dev/mapper/esdata04-esdata04 200G 273M 200G 1% /esdata/data04
/dev/mapper/esdata02-esdata02 200G 271M 200G 1% /esdata/data02
/dev/mapper/esdata06-esdata06 200G 291M 200G 1% /esdata/data06
/dev/mapper/eslog-eslog 100G 150M 100G 1% /var/log/elasticsearch
/dev/mapper/esdata01-esdata01 200G 279M 200G 1% /esdata/data01

and

[root@elkhost01 ~]# lvs
 LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
 esdata01 esdata01 -wi-ao---- 200.00g
 esdata02 esdata02 -wi-ao---- 200.00g
 esdata03 esdata03 -wi-ao---- 200.00g
 esdata04 esdata04 -wi-ao---- 200.00g
 esdata05 esdata05 -wi-ao---- 200.00g
 esdata06 esdata06 -wi-ao---- 200.00g
 eslog eslog -wi-ao---- 100.00g

The next step is to install and configure Ansible. First off, configure an ansible user on both the orchestration host and target host and sync ssh keys between the two – (there’s a module that does ssh key exchange in Ansible and I will cover that at some stage)  – like so:

on both VMs :

useradd ansible
passwd ansible

# generate pub and priv keys ....
ssh-keygen -t rsa

If using strictmodes (default) in sshd_config file 
then ensure correct perms on .ssh directory and files 

chmod 700 ~/.ssh 
chmod 600 ~/.ssh/authorized_keys
[ansible@elkhost01 ~]$ ls -l ~/.ssh
total 12
-rw-------. 1 ansible ansible 404 Oct 1 13:38 authorized_keys
-rw-------. 1 ansible ansible 1675 Oct 1 13:31 id_rsa
-rw-------. 1 ansible ansible 402 Oct 1 13:31 id_rsa.pub

Exchange public keys (copy into remote hosts authorized_keys file) 
for passwordless access

[ansible@ansible-host01 ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub 10.68.64.117
/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
ansible@10.68.64.126's password:

Number of key(s) added: 1

Now try logging into the machine, with: "ssh '10.68.64.117'"
and check to make sure that only the key(s) you wanted were added.

[ansible@ansible-host01 ~]$ 

[ansible@ansible-host01 ~]$ ssh 10.68.64.117
Last login: Thu Oct 1 13:38:35 2015 from 10.68.64.113
[ansible@elkhost01 ~]$

Once you have passwordless ssh configured between your hosts – go ahead and install Ansible on the orchestration host:

# yum install ansible -y

Once installed, there are a few post install steps and tests to make sure that Ansible is working. First off set up a Ansible hosts inventory file that will eventually contain all the hostnames broken out by deployment type. The default location for this file is  /etc/ansible/hosts. In this instance I have chosen to specify a non standard name/location in order keep my hosts file within my proposed playbook.

[ansible@ansible-host01 elk]$ pwd
/home/ansible/elk
[ansible@ansible-host01 elk]$ cat production
# file: production

[elastic-hosts]
10.68.64.117

[kibana-hosts]
10.68.64.117

[nginx-hosts]
10.68.64.126

And if the passwordless ssh setup is correct – we can test as follows :

[ansible@ansible-host01 elk]$ ansible all -m ping
10.68.64.117 | success >> {
 "changed": false,
 "ping": "pong"
}

Ansible machine configuration is done via playbooks, which are based on YAML syntax. There’s a great best practice guide here. I have followed that same best practice guide on playbook directory layout below …

elk
├── elastic.yml
├── group_vars
├── host_vars
├── kibana.yml
├── production 
├── roles
│   ├── common
│   │   ├── files
│   │   ├── handlers
│   │   ├── tasks
│   │   │   └── main.yml
│   │   ├── templates
│   │   └── vars
│   │   └── main.yml
│   ├── elastic
│   │   ├── files
│   │   │   └── elasticsearch.repo
│   │   ├── handlers
│   │   │   └── main.yml
│   │   ├── tasks
│   │   │   └── main.yml
│   │   ├── templates
│   │   │   ├── elasticsearch.default.j2
│   │   │   ├── elasticsearch.in.sh.j2
│   │   │   └── elasticsearch.yml.j2
│   │   └── vars
│   │   └── main.yml
│  └-- kibana
│      ├── files
│      ├── handlers
│      │   └── main.yml
│      ├── tasks
│      │   └── main.yml
│      ├── templates
│      │   └── kibana4.service.j2
│      └── vars
│      └── main.yml
├── site.yml

I am going to cover the individual roles for elasticsearch, logstash and kibana in subsequent posts. For now there’s a main site wide playbook :

[ansible@ansible-host01 elk]$ cat site.yml
---
# file: site.yml
- include: elastic.yml
- include: kibana.yml
- include: logstash.yml
#- include: log-forwarder.yml
#- include: redis.yml
#- include: nginx.yml

Which is then broken up into individual service specific playbooks :

[ansible@ansible-host01 elk]$ cat elastic.yml
---
#file: elastic.yml
- hosts: elastic-hosts
 roles:
 - common
 - elastic
[ansible@ansible-host01 elk]$ cat kibana.yml
---
#file: kibana.yml
- hosts: kibana-hosts
 roles:
 - kibana

I will discuss the individual roles and their associated tasks etc next time. For now this should be enough to get basic Ansible functionality going.

Sharded MongoDB config in Nutanix (3) : Backup & DR

Backing up sharded NoSQL databases can often require some additional consideration.  For example, any backup of a sharded MongoDB config needs to capture a backup for each shard and a single member of the configuration database quorum. The configuration database (configdb) holds the cluster metadata and so supports the ability to shard.  In a production environment you will need three config databases and they will all contain the same (meta)data. In this post I intend to cover the steps I recently used to backup a sharded MongoDB deployment using the snapshot technology available on my Nutanix platform.

First step prior to any backup should always be to stop the balancer. The balancer is responsible for migrating/balancing data “chunks” between the various shards. If such a migration is running while backing up then the resultant backup is potentially invalidated.

mongos> use config
switched to db config
mongos> sh.stopBalancer()
Waiting for active hosts...
Waiting for the balancer lock...
Waiting again for active hosts after balancer is off...
mongos>

At which point we can proceed to lock one of the secondary replicas in each shard. I outlined how to do this in my post relating to backing up replica sets. The command sequence is repeated below, note that this needs to be done on one secondary for each shard (and should only be done if running MMAPv1 storage engine on the replica):

rs01:SECONDARY> db.fsyncLock()
{
 "info" : "now locked against writes, use db.fsyncUnlock() to unlock",
 "seeAlso" : "http://dochub.mongodb.org/core/fsynccommand",
 "ok" : 1

Having locked the secondaries for writes, the next step is to create a virtual machine (VM) snapshot of a configdb and of a secondary belonging to each shard (replica set). Using the Nutanix Acropolis App Mobility Fabric as follows :

<acropolis> vm.snapshot_create mongo-configdb01,mongodb03,mongowt03 snapshot_name_list=mongoconfigdb01-bk,mongodb02-bk,mongowt03-bk
SnapshotCreate: complete

The above snapshots have all been created at once within a single consistency group. The next step will be to create clones from them…

<acropolis> vm.clone configdb01-clone clone_from_snapshot=mongoconfigdb01-bk
configdb01-clone: complete
<acropolis> vm.clone mongodb03-clone clone_from_snapshot=mongodb03-bk
mongodb03-clone: complete
<acropolis> vm.clone mongowt03-clone clone_from_snapshot=mongowt03-bk
mongowt03-clone: complete

At this point we can unlock each of the secondaries :

rs01:SECONDARY> db.fsyncUnlock()
{ "ok" : 1, "info" : "unlock completed" }
rs01:SECONDARY>

and re-enable the balancer:

mongos> use config
switched to db config
mongos> sh.setBalancerState(true)
mongos>

As of now I merely have the “bare bones” of a MongoDB cluster encapsulated in the three VM clones just created. The thing to bear in mind is that each clone generated from the replica snapshots contains only a subset of any sharded collection. Hopefully, ~50% each, if our shard key selection is any good! That means we can’t just proceed as in previous posts and bring up each clone as a standalone MongoDB instance. The simplest way to make use of the current clones might be to just rsync any data to new hosts in a freshly sharded deployment. So essentially, we would just transfer the data to the required volumes on the newly set up VMs. In any case, there would still be some work to do around the replica set memberships and associated config.

Alternatively, to have access to any sharded collection held in my newly created clones above. I could begin by reconfiguring each replica clone as the new primary in the replica set and create additional configdb VMs that can be registered with a new mongos VM. Recall that mongos is stateless, and gets its info from the configdbs. At which stage we can re-register the replica shards within the configdb service. For example, here’s the state of the replica sets after they have been cloned:

> rs.status()
{
 "state" : 10,
 "stateStr" : "REMOVED",
 "uptime" : 97,
 "optime" : Timestamp(1443441939, 1),
 "optimeDate" : ISODate("2015-09-28T12:05:39Z"),
 "ok" : 0,
 "errmsg" : "Our replica set config is invalid or we are not a member of it",
 "code" : 93
}
> rs.conf()
{
 "_id" : "rs01",
 "version" : 7,
 "members" : [
 {
 "_id" : 0,
 "host" : "10.68.64.111:27017",
 "arbiterOnly" : false,
 "buildIndexes" : true,
 "hidden" : false,
 "priority" : 1,
 "tags" : {

 },
 "slaveDelay" : 0,
 "votes" : 1
 },
 {
 "_id" : 1,
 "host" : "10.68.64.131:27017",
 "arbiterOnly" : false,
 "buildIndexes" : true,
 "hidden" : false,
 "priority" : 1,
 "tags" : {

 },
 "slaveDelay" : 0,
 "votes" : 1
 },
 {
 "_id" : 2,
 "host" : "10.68.64.144:27017",
 "arbiterOnly" : false,
 "buildIndexes" : true,
 "hidden" : false,
 "priority" : 1,
 "tags" : {

 },
 "slaveDelay" : 0,
 "votes" : 1
 }
 ],
 "settings" : {
 "chainingAllowed" : true,
 "heartbeatTimeoutSecs" : 10,
 "getLastErrorModes" : {

 },
 "getLastErrorDefaults" : {
 "w" : 1,
 "wtimeout" : 0
 }
 }
}

So first off we need to set each cloned replica VM as the new replica set primary and remove the no longer required (or available) hosts from the set membership :

> cfg=rs.conf()
> printjson(cfg) 
> cfg.members = [cfg.members[0]]
[
 {
 "_id" : 0,
 "host" : "10.68.64.111:27017",
 "arbiterOnly" : false,
 "buildIndexes" : true,
 "hidden" : false,
 "priority" : 1,
 "tags" : {
 },
 "slaveDelay" : 0,
 "votes" : 1
 }
]
 
> cfg.members[0].host="10.68.64.153:27017"
10.68.64.153:27017

> rs.reconfig(cfg, {force : true})
{ "ok" : 1 }

rs01:PRIMARY> rs.status()
{
 "set" : "rs01",
 "date" : ISODate("2015-10-06T14:02:23.263Z"),
 "myState" : 1,
 "members" : [
 {
 "_id" : 0,
 "name" : "10.68.64.152:27017",
 "health" : 1,
 "state" : 1,
 "stateStr" : "PRIMARY",
 "uptime" : 396,
 "optime" : Timestamp(1443441939, 1),
 "optimeDate" : ISODate("2015-09-28T12:05:39Z"),
 "electionTime" : Timestamp(1444140137, 1),
 "electionDate" : ISODate("2015-10-06T14:02:17Z"),
 "configVersion" : 97194,
 "self" : true
 }
 ],
 "ok" : 1
}

Once you have done this for all the required replica sets (these are your shards dont forget), the next step is to set up the configdb clone and create additional identical VMs that will contain the cluster metadata. The configdbs can be verified for correctness as follows :

configsvr> db.runCommand("dbhash")
{
 "numCollections" : 14,
 "host" : "localhost.localdomain:27019",
 "collections" : {
 "actionlog" : "bd8d8c2425e669fbc55114af1fa4df97",
 "changelog" : "fcb8ee4ce763a620ac93c5e6b7562eda",
 "chunks" : "bd7a2c0f62805fa176c6668f12999277",
 "collections" : "f8b0074495fc68b64c385bf444e4cc90",
 "databases" : "c9ee555dde6fc84a7bbdb64b74ef19bd",
 "lockpings" : "ba67ca64d12fd36f8b35a54e167649a8",
 "locks" : "c226b1a2601cf3e61ba45aeab146663d",
 "mongos" : "690326c2edcb410eeeb9212ad7c6c269",
 "settings" : "ce32ef7c2b99ca137c5a20ea477062f7",
 "shards" : "77d49755ba04fe38639c5c18ee5be78d",
 "tags" : "d41d8cd98f00b204e9800998ecf8427e",
 "version" : "14e1d35ba0d32a5ff393ddc7f16125a1"
 },
 "md5" : "61bde8ac240aead03080f4dde3ec2932",
 "timeMillis" : 43,
 "fromCache" : [ ],
 "ok" : 1
}

The above hashes in bold need to agree across the configdb membership. They are key to having all configdb servers in agreement. Once you have the configdbs enabled, then register them with a newly created mongos VM. Below, I am just using a single configdb to test for correctness. A production setup should always have three per cluster:

 mongos --configdb 10.68.64.151:27019

The next issue will be to correct the configdb shard info.  So as you can see from the mongos session below, the replica info in the configdb is still referring to the previous deployment:

mongos> db.adminCommand( { listShards: 1 } )
{
 "shards" : [
 {
 "_id" : "rs01",
 "host" : "rs01/10.68.64.111:27017,10.68.64.131:27017,10.68.64.144:27017"
 },
 {
 "_id" : "rs02",
 "host" : "rs02/10.68.64.110:27017,10.68.64.114:27017,10.68.64.137:27017"
 }
 ],
 "ok" : 1
}

We can correct the above setup to reflect our newly cloned shard/replica VMs. In a mongo shell session on the configdb server VM.  :

use config
configsvr> db.shards.update({_id: "rs01"} , {$set: {"host" : "10.68.64.152:27017"}})
configsvr> db.shards.update({_id: "rs02"} , {$set: {"host" : "10.68.64.153:27017"}})

You will have to restart the mongos server so that it picks up the new info from the configdb server.

mongos> db.adminCommand( { listShards: 1 } )
{
 "shards" : [
 {
 "_id" : "rs01",
 "host" : "10.68.64.152:27017"
 },
 {
 "_id" : "rs02",
 "host" : "10.68.64.153:27017"
 }
 ],
 "ok" : 1

And that, as they say, is how babies get made. At this stage you have a MongoDB cluster consisting of a configdb, registered with a mongos server, that can access both shards, formed of a replica set, formed of a single primary member. To flesh this out to production standards you could increase the configdb count (to 3) and add secondaries to the replica sets for higher availability. With some additional work perhaps (ie : renaming replica sets ?) this could form the basis of a Dev/QA system, containing a potential production workload.

Sharded MongoDB config on Nutanix (2) : High Availability

One of the prime availability considerations for any horizontal scale out application, like a MongoDB cluster, is how that cluster behaves under a failure event. We have seen (in the MongoDB case) how replica sets are configured with additional secondary instances to handle the failure of a primary instance in a replica set. We also create a mini quorum of configuration database servers and query routers to give redundancy to the cluster “infrastructure”. However, the Nutanix XCP environment provides further protection through certain features of the Acropolis management interface. Your key VMs need to be enabled to run under high availability. This is so that when the underlying hypervisor host fails for any reason, these VM’s failover to another host with sufficient CPU and RAM resources. The screenshot below shows how this (Tech Preview) feature can be enabled (pre NOS 4.5) on a per-VM basis :

enable-HA

The underlying migration functionality is also used for the manual placement of key VMs. As an example, let’s consider the following layout, where two of the configdb VMs in a MongoDB cluster are co-located on the same AHV host:

mongodb-colocated-vms

Notice in the screen capture above, there are two configdb VMs on host “D”. This means that ideally we want to migrate a MongoDB Config DB to another AHV host. Let’s move the VM mongo-configdb02 to AHV host “C”…

mongodb-migrate-VM

Note that the migration process could have automatically chosen an appropriate AHV host to receive the VM. In the above case however, we have instead specified the desired host ourselves.

We can monitor the progress and duration of any migration via the VM tasks frame in Prism:

mongodb-vm-tasks-migration

As always, this workflow can also be done manually (or scripted) through the acli interface. In this example I am migrating the VM running a query router (mongos process)….

<acropolis> vm.migrate mongos01 host=10.68.64.41 
mongos01: complete 
<acropolis>

As of the time of writing this post. Acropolis Base Software (NOS) 4.5 has been released and this feature has become part of general availability (GA). It can now be enabled cluster wide:

ha-enable-menu-4

enable-ha-4

Nutanix customers are strongly recommended to enable this feature when they require HA functionality for their VMs.

In my next post I will be completing this short blog series on sharded MongoDB configs on Nutanix. I intend to cover how Nutanix Acropolis managed snapshots and cloning are employed to create backups and then use them to perform rapid build out of potential dev/QA type environments. Stay tuned.

 

 

 

Sharded MongoDB config on Nutanix (1) : Deployment

So far I have posted on MongoDB deployments either as standalone or as part of a replica set. This is fine when you can size your VM memory to hold the entire database working set. However, if your VM’s RAM will not accommodate the working set in memory, you will need to shard to aggregate RAM from multiple replica sets and form a MongoDB cluster.

Having already discussed using clones of gold image VMs to create members for a replica set, then the most basic of MongoDB clusters requires at least two replica sets. On top of which we need a number of MongoDB “infrastructure” VMs that make MongoDB cluster operation possible. These entail a minimum of three (3) Configuration Databases (mongod –configsvr) per cluster and around one (1) Query Router (mongos) for every two shards. Here is the layout of a cluster deployment on my lab system:

2shard-system

In the above lab deployment, for availability considerations, I avoid co-locating any primary replica VM on the same physical host, and likewise any of the Query Router or ConfigDB VMs. One thing to bear in mind is that sharding is done on a per collection basis. Simply put, the idea behind sharding is that you split the collections across the replica sets and then by connecting to a mongos process you are routed to the appropriate shard holding the part of the collection that can serve your query. The following commands show the syntax to create one of the three required configdb’s (ran on three separate VMs, and need to be started first), and a Query Router, or mongos process (where we add the IP addresses of each configdb server VM) :

Config DB Servers – each ran as:
mongod --configsvr --dbpath /data/configdb --port 27019

Query Router - ran as:
mongos --configdb 10.68.64.142:27019,10.68.64.143:27019,10.68.64.145:27019

- the above IP addresses in mongos command line are the addresses of each config DB.

This brings up an issue if you are not cloning replica VMs from “blank” gold VMs. By cloning a new replica set from a current working replica set, ie: so that you essentially have each replica set holding a full copy of all your databases and their collections. Then when you come to add such a replica set as a shard, you generate the error condition shown below.

Here’s the example of what can happen when you attempt to shard and your new replica set (rs02)  is simply cloned off a current running replica set (rs01):

mongos> sh.addShard("rs02/192.168.1.52")
{s
 "ok" : 0,
 "errmsg" : "can't add shard rs02/192.168.1.52:27017 because a local database 'ycsb' 
exists in another rs01:rs01/192.168.1.27:27017,192.168.1.32:27017,192.168.1.65:27017"
}

This is the successful workflow adding both shards (the primary of each replica set) via the mongos router VM:

$ mongo --host localhost --port 27017
MongoDB shell version: 3.0.3
connecting to: localhost:27017/test
mongos>
 
mongos> sh.addShard("rs01/10.68.64.111")
{ "shardAdded" : "rs01", "ok" : 1 }
mongos> sh.addShard("rs02/10.68.64.110")
{ "shardAdded" : "rs02", "ok" : 1 }

We next need to enable sharding on the database and subsequently shard on the collection we want to distribute across the replica sets available. The choice of shard key is crucial to future MongoDB cluster performance. Issues such as read and write scaling, cardinality etc are covered here. For my test cluster I am using the _id field for demonstration purposes.

mongos> sh.enableSharding("ycsb")
{ "ok" : 1 }

mongos> sh.shardCollection("ycsb.usertable", { "_id": 1})
{ "collectionsharded" : "ycsb.usertable", "ok" : 1 }

The balancer process will run for the period of time needed to migrate data between the available shards. This can take anywhere from a number of hours to a number of days depending on the size of the collection, the number of shards, the current workload etc. Once complete however, this results in the following sharding status output. Notice  the “chunks” of the usertable collection held in the ycsb database are now shared across both shards (522 chunks in each shard) :

 mongos> sh.status()
--- Sharding Status ---
 sharding version: {
 "_id" : 1,
 "minCompatibleVersion" : 5,
 "currentVersion" : 6,
 "clusterId" : ObjectId("55f96e6c5dfc4a5c6490bea3")
}
 shards:
 { "_id" : "rs01", "host" : "rs01/10.68.64.111:27017,10.68.64.131:27017,10.68.64.144:27017" }
 { "_id" : "rs02", "host" : "rs02/10.68.64.110:27017,10.68.64.114:27017,10.68.64.137:27017" }
 balancer:
 Currently enabled: yes
 Currently running: no
 Failed balancer rounds in last 5 attempts: 0
 Migration Results for the last 24 hours:
 No recent migrations
 databases:
 { "_id" : "admin", "partitioned" : false, "primary" : "config" }
 { "_id" : "enron_mail", "partitioned" : false, "primary" : "rs01" }
 { "_id" : "mydocs", "partitioned" : false, "primary" : "rs01" }
 { "_id" : "sbtest", "partitioned" : false, "primary" : "rs01" }
 { "_id" : "ycsb", "partitioned" : true, "primary" : "rs01" }
 ycsb.usertable
 shard key: { "_id" : 1 }
 chunks:
 rs01 522
 rs02 522
 too many chunks to print, use verbose if you want to force print
 { "_id" : "test", "partitioned" : false, "primary" : "rs02" }

Additional Links:

 

 

 

 

 

 

Using Nutanix snapshots to backup MongoDB

In an earlier post I described how Nutanix clones can speed up deployment workflows for MongoDB replica sets. In this post we will cover the Nutanix VM-centric snapshot functionality to backup a MongoDB database instance. Even when the database is sharded, the ability to take a snapshot of a set of VMs at once (as part of a consistency group) ensures that a point in time (PIT) backup can easily be taken.

Let’s first consider the 3 member replica set as described in my last post. In order to take an OS level backup we would need to quiesce the I/O to one of the secondary members of the replica set. The command sequence below is for MMAPv1 storage engine only. The db.fsyncLock() flushes all pending writes and locks the instance for further writes until we run db.fsyncUnlock(). Even though we have journaling enabled, we have separated the I/O for data and journal files to different volumes (as part of our standard config). Hence the need to quiesce I/O to make sure we take a consistent snapshot.

rs01:SECONDARY> db.fsyncLock()
{
 "info" : "now locked against writes, use db.fsyncUnlock() to unlock",
 "seeAlso" : "http://dochub.mongodb.org/core/fsynccommand",
 "ok" : 1
}

Nutanix snapshots work at the VM level. This means we take a snapshot of all the vDisks in the VM hosting a MongoDB secondary at the same time. However, MongoDB can’t guarantee backup consistency if the journal and data are located in separate volumes. Hence, we still need db.fsyncLock()/db.fsyncUnlock().

<acropolis> vm.snapshot_create mongodb03
SnapshotCreate: complete

<acropolis> vm.snapshot_create mongodb03 snapshot_name_list=mongodb03-snap1
SnapshotCreate: complete

<acropolis> vm.snapshot_list mongodb03
Snapshot name Snapshot UUID
mongodb03-snap1 41c77ddc-8cc7-49bf-a250-23d52031b76e
mongodb03_2015-09-08T10:04:00.805780 7cb77f7f-1eb6-4103-8909-e8e50c318151

Above we have taken two snapshots (by way of example) and show the naming conventions returned for both a named snapshot and the default. See the rest of the workflow below…

<acropolis> vm.clone mongodb04 clone_from_snapshot=mongodb03-snap1
mongodb03-backup: complete

<acropolis> vm.restore mongodb03 mongodb03-snap1
VmRestore: complete

<acropolis> snapshot.delete mongodb03-snap1
Delete 1 snapshots? (yes/no) y
Please type 'yes' or 'no': yes
mongodb03-snap1: complete

We can carry out the same workflow via Nutanix Prism. I have taken a screenshot of the available commands (see below) in the VM Snapshots frame. The command sequence is issued via the  VM tab in the GUI. The desired CLI arguments such as clone and snapshot names for example, are entered in the resulting popups. For reasons of brevity I have not included all the required screenshots.

vm-snapshot-actions

Having taken our snapshot, we can now unlock the replica – make sure you have kept the session where you called the db.fsyncLock() command open until this time.

rs01:SECONDARY> db.fsyncUnlock()
{ "ok" : 1, "info" : "unlock completed" }
rs01:SECONDARY>

Any newly created clones from the Nutanix VM snapshot can be used to seed a test environment with production strength data. Bearing in mind that the backup/clone was made from a VM that formed part of a replica set, we would like to be able to start the MongoDB instance in standalone mode. You can do this as follows ….

  • power off the newly created clone (mongodb04)
  • Login and create a sub-directory in the MongoDB data directory in order to save the local db files to…
[root@mongodb04 data]# mkdir local-old
[root@mongodb04 data]# mv local.* local-old/
  • edit /etc/mongod.conf : comment out the line naming the replica set and make sure the bind_ip reflects the IP address assigned to the VM (via DHCP in my case)
[mongod@mongodb04 ~]$ ip a | grep eth0
2: eth0: <broadcast,multicast,up,lower_up> mtu 1500 qdisc pfifo_fast state UP qlen 1000
 inet 10.68.64.138/24 brd 10.68.64.255 scope global eth0

[mongod@mongodb04 ~]$ egrep "bind_ip|replSet" /etc/mongod.conf
bind_ip=127.0.0.1,10.68.64.138
#replSet=rs01
  • Start up the mongod (now as a standalone) instance with the newly edited configuration file
  • Drop into a mongo shell. You can then verify the consistency of the copied data (in my example I am using a synthetic database created by the YCSB benchmark software) …
[mongod@mongodb04 ~]$ mongo
MongoDB shell version: 3.0.3
connecting to: test
>
> show dbs
enron_mail 3.952GB
local 0.078GB
ycsb 207.853GB
>
> use ycsb
switched to db ycsb
>
> show collections
system.indexes
usertable
>
> db.usertable.findOne()
{
 "_id" : "user6284781860667377211",
 "field5" : BinData(0,"PTo4JCo4JiskPDU2PiIzPikmMycrNSQjJzo3NiEqJjImIyY0PSE5KS43ICM4Liw+MCclLyE1JCA3PDM3OCsoMz0nKD8rISYrKCw8Pik4JT0jIyYuKDkoOy4gJCkoPSMkOywnNA=="),
 "field4" : BinData(0,"PyI/NSIqITEhKzQxPTwlNyIzJy0zLz05IzUvLDkwLi0uNjQwKysnMSQ5Lj8hMDozPDo+JTg/IDw6NTI9PzAkNjIjLTw8OyYvOTsgLTM2IjcvPCk4Ij84Ny45LiYhJC0mKT0qKw=="),
 "field3" : BinData(0,"MTkwLj0sPTYtPTg7OTUmIz0xJSM3OzM+Kyw3PyckPDYnMyotMC84JDIjMzEnITktJi4mMz0rLTkzODkoPCUkMT4pJzEpKiM8JDUlKjUxLiY5PzIxIiM0OCgpLTUpLyQ1JCogLw=="),
 "field2" : BinData(0,"Pz4jNTcnLDgwKS04ISc7Oi01IDkrNDclPTsuMyElOSU/KywvMzY6PDU0Jyo3NyYjKig8NzMrOTs7LiQiMy0hKioiPiMkPTgyLDAyPSoxJS4tOiUyLzE9Pz8rJy0zPjo2PCszNA=="),
 "field9" : BinData(0,"KTknMSs2PyEiMjUiPz4qKiM9MS0iIj4vMyMyNCsjPSUkJyYsJz0nLD0/NSY3MyQ7KCo/OikiNyU3MjQ8MCArLj8wMyM0MiAgJiUzOCc5KjYiJTw8Izw+OzA7NCAiPzA5ISYxOg=="),
 "field8" : BinData(0,"KCgoLTQ4ICE0IDY0ODAyIS86LiYrKzQmIyIpJy0nKz0gICYpMTQsLT86PyAiKjs/Kzw3MSciJDYkJTssLiMsLiUtLzMtOiYpND8nMCQpOzopMyQ1OyYsNy8uPysxNyg1PCQ6MA=="),
 "field7" : BinData(0,"ITM1KiUmKCcnKjA8IT8xPCIjKzc4Ijs+IyYhICMlKCsjPy09MygxKTohPDgjLjIzMzwpPj45Oyg2JTgsLjYlJS0xPCc+IiolKDw0JiQkLyg4NiohID0xPjIyMic9KDgqKD0zKg=="),
 "field6" : BinData(0,"IjovMTotODUxPD0mPikqLyo4IzonKjc+LDwuLCEhICghOzYxJTw5PiAzIzktNjA9OzUqNiovMTszMDMqLDQmNjk8IDA0NjAvPz8mPygnITU+OSY8JCIzLDAgOSEmNyA1KCQ+Jw=="),
 "field1" : BinData(0,"ID0sMjg8JiAiKz4gITYjNCkxLSs+IC8/JTAiNCA8PignMTEjNyogKyU0Lz4lNz0mIzohNCQ8LSwhICw/OT82PCYyPjk2KSElIDo5MSU2Kig9NiUnOjwpLi0vOjIqJz0sKDExOw=="),
 "field0" : BinData(0,"OikgJykuOTchIjUlISghLCkwOis/Pyw4ISo7Jyw6Kz46NTwlODIsKDQqLT0uNi0rOTcsOzA4Lzo3PDcyMjozLiItKSY1JSsoLSIkPio7KC4xOzs7MzY7Oj0nOzU+IyEuJDMwJA==")
}

The VM can now be easily migrated anywhere across the Nutanix cluster using the features of the Nutanix Distributed filesystem (NDFS). If additional availability is required then we can clone “blank” gold image VMs and form a new replica set. To avoid confusion the naming convention of the new replica set(s) can reflect the intended use case perhaps : dev, test, QA etc.

Using Nutanix clones to deploy MongoDB replica set

In this post I am going to look at setting up a replica set to support high availability in a MongoDB environment. Replica sets contain a primary MongoDB database and a number of additional secondary replica databases. Any one of the allowed replicas can become primary in the event that the original primary fails for whatever reason. Replica set membership count is usually an odd number in order that new primary elections are not tied.

Building out an HA MongoDB setup on Nutanix is relatively easy to do. Each MongoDB instance is hosted in a separate, sandboxed environment. In our case a virtual machine (VM). Each VM is then located on a separate physical hypervisor host. I have a gold image VM that has a MongoDB instance installed along recommended best practice guidelines. This VM gets cloned as required when I need to build out a new MongoDB environment. So for a 3 member replica set I need 3 clones.

three-replicaset

From a cluster CVM node type:

$ acli 

<acropolis> vm.clone mongodb01,mongodb02,mongodb03 clone_from_vm=mongodb30-gold
mongodb01: complete
mongodb02: complete
mongodb03: complete
 
<acropolis> vm.list
...
mongodb01: 2b9498c1-502e-454e-93c8-931a45a321b6
mongodb02: 9a445d26-caf9-4ddf-9d8e-296ea8b6e19e
mongodb03: 9a5512fa-3d19-4ddc-8cac-11721f999459
...

<acropolis> vm.on mongodb01,mongdb02,mongodb03
mongodb01: complete
mongodb01: complete
mongodb01: complete

After powering on the VMs, check that mongod starts correctly on default port 27017 on each VM. First thing to make sure is that the mongod process is listening on the correct address. I have set my VMs to use DHCP and this is the address that the service needs to listen on.

# ip a

2: eth0: <broadcast,multicast,up,lower_up> mtu 1500 qdisc pfifo_fast state UP qlen 1000
 link/ether 52:54:00:db:17:76 brd ff:ff:ff:ff:ff:ff
 inet 10.68.64.111/24 brd 10.68.64.255 scope global eth0


# cat /etc/mongod.conf | grep -i bind_ip
 bind_ip=127.0.0.1,10.68.64.111

# service mongod restart
# service mongod status

Once all of the VMs are up and running on their respective address:port tuples, make sure that we enable firewall access via iptables. Each VM, that will form part of the replica set, needs to allow access to the other members via mongod port 27017. So for a replica set with members 10.68.64.111, 10.68.64.114, 10.68.64.113, then for each member, in this example 10.68.64.111, run…

# iptables -A INPUT -s 10.68.64.113 -p tcp --destination-port 27017 -m state \
--state NEW,ESTABLISHED -j ACCEPT
# iptables -A INPUT -s 10.68.64.114 -p tcp --destination-port 27017 -m state \
--state NEW,ESTABLISHED -j ACCEPT

# service iptables save
iptables: Saving firewall rules to /etc/sysconfig/iptables:[ OK ]
# service iptables reload

abridged iptables -L output after the above changes….

Chain INPUT (policy ACCEPT)
target prot opt source destination
ACCEPT tcp -- 10.68.64.113 anywhere tcp dpt:27017 state NEW,ESTABLISHED
ACCEPT tcp -- 10.68.64.114 anywhere tcp dpt:27017 state NEW,ESTABLISHED

Check access by performing a series of bi-directional tests between all the replica set members:

<10.68.64.111>$ mongo --host 10.68.64.113 --port 27017
MongoDB shell version: 3.0.3
connecting to: 10.68.64.113:27017/test
>
> quit()

Should any of the connection tests fail then revisit the iptables entries. Usual troubleshooting applies with telnet or nc, netstat etc.

In order to create the replica set, connect via ssh to each VM and edit the mongod.conf to include the replSet functionality:

$ grep -i replSet /etc/mongod.conf
replSet=rs01

Restart the mongod process (sudo service mongod restart) and then start a mongo shell session, the first member of the set (primary) needs to run :

$ mongo
MongoDB shell version: 3.0.3
connecting to: test
> rs.initiate()
{
 "info2" : "no configuration explicitly specified -- making one",
 "me" : "10.68.64.111:27017",
 "ok" : 1
}
rs01:PRIMARY>

You can use the shell commands rs.conf() and rs.status() to check the replica set at any point. We’ll look at one of these outputs after completing the replica set creation. Next, from the same mongo shell session, add the other two replica nodes:

rs01:PRIMARY> rs.add("10.68.64.113")
{ "ok" : 1 }

rs01:PRIMARY> rs.add("10.68.64.114")
{ "ok" : 1 }

Potential error scenarios

  •  if you didn’t clone the VMs for the replica set from a blank gold image but rather from a VM already running a replicated mongodb configuration. Then the replication commands report errors similar to this :
{
 "info2" : "no configuration explicitly specified -- making one",
 "me" : "10.68.64.111:27017",
 "info" : "try querying local.system.replset to see current configuration",
 "ok" : 0,
 "errmsg" : "already initialized",
 "code" : 23
}

On the proviso that this is a greenfield install, delete the local db config files in the data directory and re-run the rs.initiate()

  • if the firewall rules are not set correctly then the following error message is thrown:
 "errmsg" : "Quorum check failed because not enough voting nodes responded; 
required 2 but only the following 1 voting nodes responded: 10.68.64.111:27017; 
the following nodes did not respond affirmatively: 
10.68.64.131:27017 failed with Failed attempt to connect to 10.68.64.131:27017; 
couldn't connect to server 10.68.64.131:27017 (10.68.64.131), 
connection attempt failed",

Ensure that the firewall rules allow proper access between the VM’s.

  • if replication is not enabled correctly in the mongod configuration files on each host of the replica set :
"errmsg" : "Quorum check failed because not enough voting nodes responded; 
required 2 but only the following 1 voting nodes responded: 10.68.64.110:27017; 
the following nodes did not respond affirmatively: 
10.68.64.114:27017 failed with not running with --replSet",

Once the replica set configuration is complete, check the setup by running rs.status() or rs.conf() to confirm :

rs01:PRIMARY> rs.conf()
{
 "_id" : "rs01",
 "version" : 3,
 "members" : [
 {
 "_id" : 0,
 "host" : "10.68.64.111:27017",
 "arbiterOnly" : false,
 "buildIndexes" : true,
 "hidden" : false,
 "priority" : 1,
 "tags" : {

 },
 "slaveDelay" : 0,
 "votes" : 1
 },
 {
 "_id" : 1,
 "host" : "10.68.64.113:27017",
 "arbiterOnly" : false,
 "buildIndexes" : true,
 "hidden" : false,
 "priority" : 1,
 "tags" : {

 },
 "slaveDelay" : 0,
 "votes" : 1
 },
 {
 "_id" : 2,
 "host" : "10.68.64.114:27017",
 "arbiterOnly" : false,
 "buildIndexes" : true,
 "hidden" : false,
 "priority" : 1,
 "tags" : {

 },
 "slaveDelay" : 0,
 "votes" : 1
 }
 ],
 "settings" : {
 "chainingAllowed" : true,
 "heartbeatTimeoutSecs" : 10,
 "getLastErrorModes" : {

 },
 "getLastErrorDefaults" : {
 "w" : 1,
 "wtimeout" : 0
 }
 }
}

From the output above we can see the full replica set membership, both the member function and status. Things like priority settings and whether or not the replica is hidden to user applications queries etc. Also, whether a replica is a full mongod instance or an arbiter (simply there to mitigate against primary election ties). Or, if any of the replicas have a delay enabled (used for backup/reporting duties).

In an earlier post I have shown the available mongo shell commands to calculate the working set for the database. For read intensive workloads, where your working set is sized to fit available RAM in the mongod server VMs; a replica set deployment can be used to run MongoDB and support high availability.