Elasticsearch Sizing on Nutanix

One node, one index, one shard

The answer to the question : “how big should I size my Elasticsearch VMs and how what kind of performance will I get?”, always comes down to the somewhat disappointing answer of “It depends!?” It depends on the workload – be it index or search heavy, on the type of data being transformed and so on. 

The way to size your Elasticsearch environment is by finding your “unit of scale”, this is the performance characteristics you will get for your workload via a single shard index running in a single Virtual Machine (VM). Once you have a set of numbers for a particular VM config then you can scale throughput etc, via increasing the number of VMs and/or indexes to handle additional workload.

Virtual Machine Settings

The accepted sweet spot for VM sizing an indexing workload is something like 64GB RAM/ 8+ vCPUs. You can of course right size this further where necessary, thanks to virtualisation. I assign just below half the RAM (31GB) to the heap for the Elasticsearch instance. This is to ensure that the JVM uses compressed Ordinary Object Pointers (OOPs) on a 64 bit system. This heap memory also needs to be locked into RAM

# grep -v ^# /etc/elasticsearch/elasticsearch.yml

cluster.name: esrally
node.name: esbench

path.data: /elastic/data01    # <<< single striped data volume 
bootstrap.memory_lock: true   # <<< lock heap in RAM
http.port: 9200
discovery.zen.minimum_master_nodes: 1  # <<< single node test cluster
xpack.security.enabled: false

# grep -v ^# /etc/elasticsearch/jvm.options

From the section above , notice the single mount point for the path.data entry. I am using a 6 vdisk LVM stripe. While you can specify per-vdisk mount points in a comma separated list, unless you have enough indices to make sure all the spindles turn (all the time) then you are better off with logical volume management. You can ensure you are using compressed OOPs by checking for the following log entry at startup

[2017-08-07T11:06:16,849][INFO ][o.e.e.NodeEnvironment ] [esrally02] heap size [30.9gb], compressed ordinary object pointers [true]

Operation System Settings

Set the required kernel settings 

# sysctl -p 
vm.swappiness = 0
vm.overcommit_memory = 0
vm.max_map_count = 262144

Ensure file descriptors limits are increased

# ulimit –n 65536


curl –XGET**.max_file_descriptors

Disable swapping, either via the cli or remove swap entries from /etc/fstab

# sudo swapoff –a 

Elasticsearch Bulk Index Tuning

In order to improve indexing rate and increase shard segment size, you can disable refresh interval on an initial load.  Afterwards, setting this to 30s (default=1s) in production means larger segments sizes and potentially less merge pressure at a later date.

curl -X PUT "" -H 'Content-Type: application/json' -d'
    "index" : {
        "refresh_interval" : "-1"

Recall that we only want a single shard index and no replication for our testing. We can achieve this by either disabling replication on the fly or creating a template that configures the desired settings at index creation 

Disable replication globally ...

curl -X PUT "" -H 'Content-Type: application/json' -d '{"index" : {"number_of_replicas" : 0}}’

or create a template - in this case, for a series of index name regex patterns...

# cat template.json
        “index_patterns": [ “ray*”, "elasticlogs”],
        "settings": {
                "number_of_shards": 1,
                "number_of_replicas": 0
curl -s -X PUT "" -H 'Content-Type: application/json' -d @template.json

Elasticsearch Benchmarking tools

esrally is a macrobenchmarking tool for elasticsearch. To install and configure – use the following quickstart guide. Full information is available here :


rally-eventdata-track –  is repository containing a Rally track for simulating event-based data use-cases. The track supports bulk indexing of auto-generated events as well as simulated Kibana queries.


esrally --pipeline=benchmark-only --target-hosts= 
--track=eventdata --track-repository=eventdata --challenge=bulk-size-evaluation
eventdata bulk index - 5000 events/request highlighted @indexing rate of ~50k docs/sec
eventdata bulk index – 5000 events/request highlighted @indexing rate of ~50k docs/sec
httpd logs index test - highlighted @indexing rate ~80k docs/s
httpd logs index test – highlighted @indexing rate ~80k docs/s

Elasticsearch is just one of a great many cloud native applications that can run successfully on Nutanix Enterprise Cloud. I am seeing more and more opportunities to assist our account teams in the sizing and deployment of Elasticsearch. However, unlike other Search and Analytics platforms Elasticsearch has no ready made formula for sizing. This post will hopefully allow people to make a start on their Elasticsearch sizing on Nutanix and, in addition, help identify future steps to improve their performance numbers.

Further Reading

Elasticsearch Reference

Openstack + Nutanix : Nova and Cinder integration

Now that we have setup an allinone deployment of the Acropolis OVM, configured networking, and an image registry. It’s time to look at the steps required to launch virtual machine (VM) instances and setup appropriate storage.  The first steps to take are to provide the necessary network access rules for the VM’s if they don’t already exist. The easiest way to do this is to create rules to ensure SSH (port 22) access from any address range and to make the VMs pingable.

Compute > Access & Security > Security Groups

Compute > Access & Security > Security Groups

Compute > Access-Security > Security Groups

Compute > Access & Security > Security Groups

Next create an SSH key-pair that can be assigned to your instances and subsequently control VM remote login access to holders of the appropriate private key. I will show how this is used later in the post, when we launch an instance. First, select the Key Pairs tab in the Access & Security frame and save the resulting PEM file to be used when accessing your VMs.


Create a named key-pair (for example fedora-kp) for the set of instances you will create.

As an example, I am going to create a single volume using the Cinder service, in order to show we can attach this to a running VM. In this instance, Cinder gets redirected to the Acropolis Volume API and the subsequent volume gets attached to the instance as an iSCSI block device.


Next step will be to spin up a number of VM instances, I have given a generic instance prefix for the name, and I am choosing to boot a Fedora 23 Cloud image. You can see the Flavour Details in the side panel in the screenshot below – Note the root disk size is big enough to accommodate the base image.


I also need to specify the SSH key-pair I am using and the Network on which the instances get launched. See below :



At this point I can go ahead and launch my instances. We can see the 10 instances chosen all get created below, along with the assigned IP addresses from the already defined network, the instance flavour, and the named key-pair ….


So now, if we were to take a look at the Nutanix cluster backend via Prism, we can see those VM instances created on the cluster and how they are spread across the hypervisor hosts. That’s all down to Acropolis management and placement.


We can dig a little deeper into the Acropolis functionality and show how each of the steps taken by the Acropolis REST API calls have built and deployed the VMs on the backend. Here’s the list of VMs that were created as defined in the http://<CVM-IP>:2030 page.


And we can see the breakdown of the individual task steps and how long each one took and how long they might have queued for, and if they were ultimately successful and so on. The key take away from all this is that the speed of creation of the VM instances is largely down to the Acropolis management interfaces consumed by the REST API calls.


Let’s take one of those VMs and add some volumes to it, let’s add a data and a log volume to fedvm-10. First of all we need to create the iSCSI volumes



Then we can attach the volumes to the VM instance ….


We now have the two volumes attached to the VM ….


The two volumes should show up as virtual disks under /dev in the VM itself. We can verify this by logging into the VM directly using the private key I created earlier as part of the key-pair assigned to this series of instances.

# ssh -i ./fedora-kp.pem fedora@
Last login: Thu Apr 7 21:28:21 2016 from
[fedora@fedvm-10 ~]$ 

[fedora@fedvm-10 ~]$ sudo fdisk -l
Disk /dev/sda: 3 GiB, 3221225472 bytes, 6291456 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x6e3892a8

Device Boot Start End Sectors Size Id Type
/dev/sda1 * 2048 6291455 6289408 3G 83 Linux

Disk /dev/sdb: 10 GiB, 10737418240 bytes, 20971520 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Disk /dev/sdc: 50 GiB, 53687091200 bytes, 104857600 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

So from here, we can format the newly assigned disks and mount them as needed.

That’s it for this post, hopefully this series of posts has gone a little way to clarify how a Nutanix cluster can be used to scale out an Openstack deployment to form a highly available on-premise cloud. The deployment of which is radically simplified by using Nutanix as the Compute, Volume, Image and Network backend.

In future posts I intend to look at deploying an upstream Openstack controller, have a play around with snapshots within Openstack and their use as images. Also, some additional troubleshooting perhaps. Let me know what you find useful.

Openstack + Nutanix: Glance Image Service

This post will cover the retrieval of base or cloud OS images via the Openstack Glance image service and how the Acropolis driver interacts with Glance and maintains the image data on the Nutanix Distributed Storage Fabric (DSF).

From the Openstack documentation:

  • The Glance image service includes discovering, registering and retrieving virtual machine images
  • has a RESTful API  – allows querying of image metadata and actual image retrieval
  • It has the ability to copy (or snapshot) a server image and then to store it promptly. Stored images then can be used as templates to get new servers up and running quickly, and can also be used to store and catalog unlimited backups.

The Acropolis driver interacts with the Glance service by redirecting an image from the Openstack controller to the Acropolis DSF. Aside from any image metadata (ie: image configuration details) being stored in Glance, the image itself is actually stored on the Nutanix cluster. We do not store any images in the OVM, either in the Glance store or anywhere else within the Openstack Controller.

Images are managed in Openstack via System > Images – see screenshot below for example list of available images in an Openstack environment


Images in Openstack are mostly downloaded via HTTP URL. Though file upload does work.  The image creation workflow in the screenshot below shows a Fedora 23 Cloud image in QCOW2 format being retrieved. I have left the respective “Minimum Disk” (size) and “Minimum RAM” fields blank – so that no minimum is set for either.


You can confirm the images are loaded into the Nutanix Cluster backend by viewing the Image Configuration menu in Prism. The images in Prism are stored on a specific container in my case.


Similarly, Prism will report the progress of the Image upload to the cluster through the event and progress monitoring facility on the main menu bar.


If all you really need is a quick demo perhaps, then Openstack suggests the following OS image for test purposes. Use this simply to test and demonstrate the basic glance functionality via the command line. Works exactly the same if done via Horizon GUI, however.

[root@nx-ovm ~]# source keystonerc_admin
[root@nx-ovm ~(keystone_admin)]# glance image-create --name cirros-0.3.2-x86_64 \
--is-public true --container-format bare --disk-format qcow2 \
--copy-from http://download.cirros-cloud.net/0.3.2/cirros-0.3.2-x86_64-disk.img

| Property         | Value                                |
| checksum         | None                                 |
| container_format | bare                                 |
| created_at       | 2016-04-05T10:31:53.000000           |
| deleted          | False                                |
| deleted_at       | None                                 |
| disk_format      | qcow2                                |
| id               | f51ab65b-b7a5-4da1-92d9-8f0042af8762 |
| is_public        | True                                 |
| min_disk         | 0                                    |
| min_ram          | 0                                    |
| name             | cirros-0.3.2-x86_64                  | 
| owner            | 529638a186034e5daa11dd831cd1c863     |
| protected        | False                                |
| size             | 0                                    |
| status           | queued                               |
| updated_at       | 2016-04-05T10:31:53.000000           |
| virtual_size     | None                                 |

This is then reflected in the glance image list

[root@nx-ovm ~(keystone_admin)]# glance image-list
| ID                                   | Name                       | Disk Format | Container Format | Size       | Status |
| 44b4c9ab-b436-4b0c-ac8d-97acbabbbe60 | CentOS 7 x86_84            | qcow2       | bare             | 8589934592 | active |
| 033f24a3-b709-460a-ab01-f54e87e0e25b | cirros-0.3.2-x86_64        | qcow2       | bare             | 41126400   | active |
| f9b455b2-6fba-46d2-84d4-bb5cfceacdc7 | Fedora 23 Cloud            | qcow2       | bare             | 234363392  | active |
| 13992521-f555-4e6b-852b-20c385648947 | Ubuntu 14.04 - Cloud Image | qcow2       | bare             | 2361393152 | active |

One other thing to be aware of is that all network, image, instance and volume manipulation should only be done via the Openstack dashboard. All the Openstack elements created this way can not subsequently be changed or edited with the Acropolis Prism GUI. Both management interfaces are independent of one another. In fact the Openstack Services VM (OVM) was intentionally designed this way to be completely stateless. Though obviously this could change in future product iterations, if it was deemed to be a better solution going forward.

I have included the Openstack docs URL with additional image locations for anyone wanting to pull images of their own to work with. This is an excellent reference location for potential cloud instance images for both Linux distros and Windows:


Next up, we will have a look at using the Acropolis Cinder plugin for Block Storage and the Nova Compute service integration.

Openstack + Nutanix : Neutron Networking

In my last post we covered the all-in-one installation of the OpenStack controller with the Nutanix shipped Acropolis Openstack drivers. The install created a single virtual machine, the Openstack Services VM (OVM). In this post I intend to talk about setting up a Network Topology using the Openstack dashboard and the Neutron service integration with Nutanix. I will be able to show how this gets reflected in the Acropolis Prism GUI. First, let’s create a public network for our VMs to reside on. Navigate via the Horizon dashboard to Admin > Networks

Navigate to System > Networks on the Horizon dashboard

1. Navigate to System > Networks on the Horizon dashboard and select +Create Network

Create Network

Currently, only local and VLAN “Provider Network Types” are supported by the Nutanix Openstack drivers. In the screenshot below, I am creating a segmented network (ID 64), named public-network, in the default admin tenant. I specify the network as shared and external.

Pro Tip

Do not use a VLAN/network assignment that has already been defined within the Nutanix cluster. Any network/subnet assignment should be done within Openstack using network parameters reserved specifically for your Openstack deployments.



2. Add your network details – Name, Project (tenant), Network Type, VLAN ID, etc

Create Subnet

Each network needs to have a subnet created with an associated DHCP pool. This DHCP pool information gets sent via the appropriate API call to Acropolis. Acropolis management associates the IP address with the vnic on the Acropolis VM. The Acropolis Openstack driver reads this configuration and, when the cloud instance gets powered on in Openstack, it will register the IP address with the Openstack VM. See setup screenshots below…

Pro Tip

When creating a subnet, you must specify a DNS server.




3. Subnet creation – requires subnet name,  network address (CIDR notation), gateway address, DHCP pool range, DNS servers

We can see the newly created network reflected in Acropolis via the Prism GUI in the screenshot below:



Pro Tip

Once a network has been configured and you decide to add an additional cluster. That network will not be extended across the new cluster. You have a choice, you can either add a new network. Or, you can remove the network and re-add it so that it gets created across all the currently configured clusters.


Now that we have a network configured we can look at setting up cloud instances to run on it. To do that we need to set up the Glance Image Service and that’s the subject of my next post.

Additional reading



Openstack + Nutanix Integration : A Configuration Primer

As of Acropolis Base Software (NOS) version 4.6, Nutanix released a set of Acropolis drivers that provide Openstack + Nutanix integration. These drivers allow an Openstack deployment to consume the Acropolis management infrastructure in a similar way to a cloud service or within a datacenter. I intend to use this series of blog postings to cover a walk-through of setting up the Nutanix Openstack drivers deployment and configuring cloud instances.

The integration stack works by having the Openstack controller installed in a separate Nutanix Openstack Services VM (Nutanix OVM). The Acropolis drivers can be installed into that same OVM. These drivers then interpose on the Openstack services for compute, image, network and volume. By subsequently translating Openstack requests, into the appropriate REST API calls in Acropolis management layer, a series of Nutanix clusters are then managed by the Openstack controller.

Openstack Acropolis integrated stack

Openstack – Acropolis Driver integrated stack

The Acropolis drivers are installed in either one of :

All-In-One Mode: You use the OpenStack controller included in the Nutanix OVM to manage the Nutanix clusters. The Nutanix OVM runs all the OpenStack services and the Acropolis OpenStack drivers.

Driver-Only Mode: You use a remote (or upstream) OpenStack controller  to manage the Nutanix clusters, and the Nutanix OVM includes only the Acropolis OpenStack drivers.

In either case, Nutanix currently only supports the Kilo release of Openstack.

I will go into further detail around Openstack and Acropolis architecture integration in future posts. For now let’s start by getting things set up. First requirement is to download the OVM image – from the Nutanix Portal – and then add it to the Acropolis Image Service….

$ wget http://download.nutanix.com/nutanix-open-stack/nutanix_openstack-2015.1.0-1.ovm.qcow2

and upload locally....

<acropolis> image.create ovm source_url=nfs://freenas/naspool/openstack/nutanix_openstack-2015.1.0-1.ovm.qcow2 container=Image-Store

Also, Prism allows upload from your desktop if preferred/possible
or, go direct via the internet...
<acropolis> image.create ovm source_url=http://download.nutanix.com/nutanix-open-stack/nutanix_openstack-2015.1.0-1.ovm.qcow2 container=Image-Store

Note : As I had already created several containers on my cluster, I needed to specify the name of the preferred container in the above syntax . Otherwise, the container name of default is expected, if only one container exists and no container name is supplied. The following error is shown otherwise…

 kInvalidArgument: Multiple containers have been created, cannot auto select

Create the Openstack Services VM (OVM) – using Acropolis command line on a CVM on the Nutanix cluster. This can all be done very easily via the Prism GUI but for reasons of space I am going the CLI route. Refer to the Install Guide on the Nutanix Portal. Select Downloads > Tools & Firmware from the menus/drop-downs

<acropolis> vm.create nx-ovm num_vcpus=2 memory=16G
nx-ovm: complete
<acropolis> vm.disk_create nx-ovm clone_from_image=ovm
DiskCreate: complete
<acropolis> vm.nic_create nx-ovm network=vlan.64
NicCreate: complete
<acropolis> vm.on nx-ovm

Note: if you are unfamiliar with creating a network for your VMs to reside on then take a look here  , where I discuss setting up VMs and associated disks and networking on the Nutanix platform.

For now let’s consider the all-in-one install mode , there are just three steps….

o Login to the VM using the supplied credentials (via ssh)

o Add the OVM
[root@nx-ovm]#  ovmctl --add ovm --name nx-ovm --ip --netmask --gateway --nameserver --domain nutanix.com

o Add the Openstack Controller 
[root@nx-ovm]# ovmctl --add controller --name kilo --ip

o Add the Nutanix clusters that you want to manage with OpenStack, 
one cluster at a time.
[root@nx-ovm]# ovmctl --add cluster --name SAFC --ip --username admin --password nutanix/4u --container_name DEFAULT-CTR --num_vcpus_per_core 4

Just to point out one or two things in the above CLI. The IP address for both OVM and Openstack controller are the same – should make sense as they are both part of  the same VM in this case. Also, use the Cluster Virtual IP when registering the cluster. This is for HA reasons. Rather than use the IP of an individual CVM, use the failover IP for the cluster itself.  You need to specify a container name if it is not default (or there are more than one).

Pro Tip


If you remove or rename the container with which you added a Nutanix cluster, you must restart the services on the OpenStack Controller VM by running the following command:

ovmctl --restart services


You can now verify the base install using….

[root@nx-ovm ~]# ovmctl --show

Allinone - Openstack controller, Acropolis drivers

OVM configuration:
1 OVM name : nx-ovm
 IP :
 Netmask :
 Gateway :
 Nameserver :
 Domain : nutanix.com

Openstack Controllers configuration:
1 Controller name : kilo
 IP :
 Auth strategy : keystone
 Auth region : RegionOne
 Auth tenant : services
 Auth Nova password : ********
 Auth Glance password : ********
 Auth Cinder password : ********
 Auth Neutron password : ********
 DB Nova : mysql
 DB Cinder : mysql
 DB Glance : mysql
 DB Neutron : mysql
 DB Nova password : ********
 DB Glance password : ********
 DB Cinder password : ********
 DB Neutron password : ********
 RPC backend : rabbit
 RPC username : guest
 RPC password : ********
 Image cache : disable

Nutanix Clusters configuration:
1 Cluster name : SAFC
 IP :
 Username : admin
 Password : ********
 Vnc : 49795
 Vcpus per core : 4
 Container name : DEFAULT-CTR
 Services enabled : compute, volume, network

Version : 2015.1.0
Release : 1
Summary : Acropolis drivers for Openstack Kilo.

Additionally, pointing your browser at the IP address of the OVM – and navigating to Admin > System Information and selecting the Services Tab, you should see that all Openstack services are provided by the OVM IP address. Similarly, under the Compute, Block Storage and Network tabs it should also report that these services are being provided via the OVM


One other check would be to look at Admin > Hypervisors, a Nutanix cluster reports as a single hypervisor in the Openstack config – see below:


I hope  this is enough to get people started looking at and trying out Openstack deployments using Nutanix. In the next series of posts I will look at configuring images, setting up networks/subnets and then move on to creating instances.

Additional info / reading

Nutanix: Cloud-like DevOps powering NoSQL for BigData

The popularity of NoSQL has increasingly come about as developers want to use the same in-memory data structures in their applications and have them map directly into a database persistence layer. For example, storing data in XML or JSON format is often hierarchical and potentially does not lend itself to being easily stored in row based tables. It becomes more complicated if the data also contains lists and objects. Not having to convert these in-memory structures into relational database structures is a major advantage in terms of time to value. Such considerations have been made all the more acute by the rise of the web as a platform for services. There’s also an economic aspect, like the prohibitive infrastructure costs required to scale up traditional RDBMS to support high availability etc. Compare this to such Web-Scale or cloud aware apps like NoSQL, which expects to “just drop in” commodity hardware at the infrastructure layer and scale out horizontally on demand.

So if we were to consider the requirements from a modern hyper-converged infrastructure (HCI) that employed the same Web-Scale paradigms used by modern cloud-aware applications. Then to deploy apps, like a NoSQL database for example, the first thing I would want to do is virtualise. This means a right-sized, sandboxed environment (ie a virtual machine) to run individual NoSQL instances. If there was a need to scale up, then it’s a simple case of increasing RAM and CPU. As the application landscape grows over time and starts to scale out, there’s increased need for more nodes/VMs.  Hence, any HCI platform needs cloud like provisioning of nodes. So providing faster time to deploy and time to value. The ability to auto-discover and add new nodes by the click of a button is quite compelling. In short, horizontal scale out needs to be easily undertaken. Say, in the middle of the production day, while running the month end workload?

Intelligent, automated data tiering, locality and balancing via post-process techniques like Mapreduce is another key requirement. As any database working set grows over time, ie: more users will mean more queries, new tables, indexes, aggregations, etc. So the ability to maintain a responsive I/O profile via SSD, as more I/O is periodically obtained from disk, will be key. If all VMs are then able to get local access to their data via SSD from a global storage fabric so much the better. While we are here, consider how you would migrate to a new(er) hardware fleet with and without a distributed storage fabric. Far easier to just drop in units of converged compute/storage and then migrate VMs to it. Compare how that would work with a large white box server estate spread across numerous racks in a DC? There’s yet another aspect of economics to all this. In that auto tiering of the storage layer means the current “working set” data is held at the most performant (and by comparison more expensive) layer. While colder data sits on cheaper spinning disk.

Another advantage of a distributed storage fabric is one of data service features. Take point in time (PIT) backups of sharded DBs, which can sometimes be a complicated issue. In which case, a data service that supports VM centric snapshots of key VMs in a consistency group can avoid another potential pain point. Also, rapid cloning of preconfigured VMs will improve deployment times and speaks to the DevOps workflows that many IT shops have increasingly adopted. Consider how easy might it be to create dev/QA environments with production style data using such mechanisms? What about burst workloads? The ability to migrate VMs between public and private cloud would bring further benefits, both as a means to provide offsite backups or move VMs between geographies.

Bear in mind there isn’t 20+ years of ecosystem software (or even tribal knowledge perhaps?) in the NoSQL community – unlike in traditional RDBMS. For this reason continual monitoring is a major requirement. The ability to support a floor to ceiling overview of VMs, hypervisor and hardware platform in terms of performance, alerts and events is paramount. We mentioned briefly above how working set size and IO throughput could affect end user experience. So the ability to predict trends in such behaviour means timely decisions about when to scale or shard an application can be made.  No discussion of any DevOps processes is complete without including REST API and/or Powershell automation capabilities. Automation is key in terms of workflow agility, allowing routine tasks to be performed repeatedly with a well understood outcome. Dev/QA environments can benefit greatly from the features already described. In addition, via the API, developers can build self-service portal software allowing them to spin up new environments in a matter of minutes.

In previous roles I worked with customers running UNIX based failover clusters protecting traditional SQL RDBMS and ERP software. Think Solaris and SUN Cluster, underpinning Oracle and SAP installs.  While running this kind of ‘Big Iron’ was considered ‘state of the art’. Coming up fast on the inside was ‘Big Data’ and with it a complete rethink on how to achieve massive scale. Traditionally, systems had scaled vertically by adding more CPU and RAM to the host platform, and horizontally by adding system boards to a midframe chassis. This came at a price and often a staggering level of administrative complexity. While Web-Scale technologies may not have completely replaced this approach yet, large scale big iron systems will continue to become more niche as time goes on in my opinion.

So, coming back to the beginning of this post. HCI is not only about scaling just to support Big Data workloads, it’s also about creating lower time to value and radical ease of use synergies with the application that sits on top of the stack. Having a HCI platform designed from the ground up with the same underlying principles as modern Web-Scale applications, means we are able to remove the operational delays and complexity that tend to act as drag anchors in today’s rapid deployment environments. IT departments are then free to focus on innovations that help the business succeed.

Configuring Docker Storage on Nutanix

I have recently been looking at how best to deploy a container ecosystem on my Nutanix XCP environment. At present I can run a number of containers in a virtual machine (VM). This will give me the required networking, persistent storage and the ability to migrate between hosts. Something that containers are only just becoming capable of in many cases. So that I can scale out my container deployment within my Docker host VM, I am going to have to consider increasing  the available space within /var/lib/docker. By default, if you provide no additional storage/disk for your Docker install, loopback files get created to store your containers/images. This configuration is not particularly performant, so it’s not supported for production.  You can see below how the default setup looks…

# docker info
Containers: 0
Images: 0
Storage Driver: devicemapper
 Pool Name: docker-253:1-33883287-pool
 Pool Blocksize: 65.54 kB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 1.821 GB
 Data Space Total: 107.4 GB
 Data Space Available: 50.2 GB
 Metadata Space Used: 1.479 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.146 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.93 (2015-01-30)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.2.5-201.fc22.x86_64
Operating System: Fedora 22 (Twenty Two)
CPUs: 1
Total Memory: 993.5 MiB
Name: docker-client

Configuring Docker Storage Options

Looking at the various methods we can use to provide dedicated block storage for Docker containers. With the Devicemapper storage driver Docker automatically creates a base thin device, this is two block devices, one for data and one for metadata. The thin device is automatically formatted with an empty filesystem on creation. This device is the base of all docker images and containers. All base images are snapshots of this device and those images are then in turn used as snapshots for other images and eventually containers. This is the Docker supported production setup. Also, by using LVM based devices as the underlying storage you are then accessing them as raw devices and no longer go through the VFS layer.

Devicemapper : direct-lvm

To begin, create two LVM devices, one for container data and another to hold metadata. By default the loopback method creates a storage pool with 100GB of space. In this example I am creating a 200G LVM volume for data and a 5G metadata volume. I prefer separate volumes where possible for performance reasons. We start by hot-adding the required Nutanix vDisks to the virtual machine (docker-directlvm) guest OS (Fedora22)

<acropolis> vm.disk_create docker-directlvm create_size=200g container=DEFAULT-CTR
DiskCreate: complete
<acropolis> vm.disk_create docker-directlvm create_size=10g container=DEFAULT-CTR
DiskCreate: complete

[root@docker-directlvm ~]# lsscsi 
[2:0:1:0] disk NUTANIX VDISK 0 /dev/sdb 
[2:0:2:0] disk NUTANIX VDISK 0 /dev/sdc 

The next step is to create the individual LVM volumes

# pvcreate /dev/sdb /dev/sdc
# vgcreate direct-lvm /dev/sdb /dev/sdc

# lvcreate --wipesignatures y -n data direct-lvm -l 95%VG
# lvcreate --wipesignatures y -n metadata direct-lvm -l 5%VG

If setting up a new metadata pool need to zero the first 4k to indicate empty metadata:
# dd if=/dev/zero of=/dev/direct-lvm/metadata bs=1M count=1

For sizing the metadata volume above, the rule of thumb seems to be 0.1% of the data volume. This is somewhat anecdotal, so size with a little headroom perhaps? Next, start the Docker daemon using the required options in the file /etc/sysconfig/docker-storage.

DOCKER_STORAGE_OPTIONS="--storage-opt dm.datadev=/dev/direct-lvm/data --storage-opt \
dm.metadatadev=/dev/direct-lvm/metadata --storage-opt dm.fs=xfs"

You can then verify that the requested underlying storage is in use, with the docker info command

# docker info
Containers: 5
Images: 2
Storage Driver: devicemapper
 Pool Name: docker-253:1-33883287-pool
 Pool Blocksize: 65.54 kB
 Backing Filesystem: xfs
 Data file: /dev/direct-lvm/data
 Metadata file: /dev/direct-lvm/metadata
 Data Space Used: 10.8 GB
 Data Space Total: 199.5 GB
 Data Space Available: 188.7 GB
 Metadata Space Used: 7.078 MB
 Metadata Space Total: 10.5 GB
 Metadata Space Available: 10.49 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Library Version: 1.02.93 (2015-01-30)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.2.5-201.fc22.x86_64
Operating System: Fedora 22 (Twenty Two)
CPUs: 1
Total Memory: 993.5 MiB
Name: docker-directlvm

All well and good so far, but the storage options to expose data and metadata locations namely dm.datadev and dm.metadatadev have been deprecated in favour of a preferred model, which is to have a thin pool reserved outside of Docker and passed to the daemon via the dm.thinpooldev storage option.  There’s a helper script in some Linux distros called  /etc/sysconfig/docker-storage-setup. This does all the heavy lifting, you just need to supply a device  and, or a volume group name.

Devicemapper: Thinpool

Once again start by creating the virtual device – a Nutanix vDisk –  and adding it to the virtual machine guest OS

<acropolis> vm.disk_create docker-thinp create_size=200g container=DEFAULT-CTR
DiskCreate: complete

[root@localhost sysconfig]# lsscsi
[0:0:0:0] cd/dvd QEMU QEMU DVD-ROM 1.5. /dev/sr0
[2:0:0:0] disk NUTANIX VDISK 0 /dev/sda
[2:0:1:0] disk NUTANIX VDISK 0 /dev/sdd

Edit the file /etc/sysconfig/docker-storage-setup as follows:

root@localhost sysconfig]# cat /etc/sysconfig/docker-storage-setup
# Edit this file to override any configuration options specified in
# /usr/lib/docker-storage-setup/docker-storage-setup.
# For more details refer to "man docker-storage-setup"

Then run the storage helper script:

[root@docker-thinp ~]# pvcreate /dev/sdd
 Physical volume "/dev/sdd" successfully created

[root@docker-thinp ~]# vgcreate docker /dev/sdd 
 Volume group "docker" successfully created 

[root@docker-thinp ~]# docker-storage-setup 
Rounding up size to full physical extent 192.00 MiB 
Logical volume "docker-poolmeta" created. 
Wiping xfs signature on /dev/docker/docker-pool. 
Logical volume "docker-pool" created. 
WARNING: Converting logical volume docker/docker-pool and docker/docker-poolmeta to pool's data and metadata volumes. 
Converted docker/docker-pool to thin pool. 
Logical volume "docker-pool" changed.

You can then verify the underlying storage being used in the usual way

[root@docker-thinp ~]# lsblk

sdd 8:48 0 186.3G 0 disk
├─docker-docker--pool_tmeta 253:5 0 192M 0 lvm
│ └─docker-docker--pool 253:7 0 74.4G 0 lvm
└─docker-docker--pool_tdata 253:6 0 74.4G 0 lvm
 └─docker-docker--pool 253:7 0 74.4G 0 lvm

[root@docker-thinp ~]# docker info
Containers: 0
Images: 0
Storage Driver: devicemapper
 Pool Name: docker-docker--pool
 Pool Blocksize: 524.3 kB
 Backing Filesystem: xfs
 Data file:
 Metadata file:
 Data Space Used: 62.39 MB
 Data Space Total: 79.92 GB
 Data Space Available: 79.86 GB
 Metadata Space Used: 90.11 kB
 Metadata Space Total: 201.3 MB
 Metadata Space Available: 201.2 MB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Library Version: 1.02.93 (2015-01-30)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.2.5-201.fc22.x86_64
Operating System: Fedora 22 (Twenty Two)
CPUs: 1
Total Memory: 993.5 MiB
Name: docker-thinp

On completion this will have created the correct entries in /etc/sysconfig/docker-storage

[root@docker-thinp ~]# cat /etc/sysconfig/docker-storage
DOCKER_STORAGE_OPTIONS=--storage-driver devicemapper --storage-opt dm.fs=xfs \
 --storage-opt dm.thinpooldev=/dev/mapper/docker-docker--pool

and runtime looks like...

[root@docker-thinp ~]# ps -ef | grep docker
root 8988 1 0 16:13 ? 00:00:11 /usr/bin/docker daemon --selinux-enabled
--storage-driver devicemapper --storage-opt dm.fs=xfs 
--storage-opt dm.thinpooldev=/dev/mapper/docker-docker--pool

Bear in mind that when you are changing the underlying docker storage driver or storage options similar to the examples described above.  Then, typically, the following destructive command sequence is run (be sure to backup any important data before running the following ….

$ sudo systemctl stop docker
$ sudo rm -rf /var/lib/docker 

post changes 

$ systemctl daemon-reload
$ systemctl start docker

Additional Info




ELK on Nutanix : Kibana

It might seem like I am doing things out of sequence by looking at the visualisation layer of the ELK stack next. However, recall in my original post , that I wanted to build sets  of unreplicated indexes and then use Logstash to fire test workloads at them. Hence, I am covering Elasticsearch and Kibana initially. This brings me to another technical point that I need to cover. In order for a single set of indexes to be actually recoverable, when running on a single node, we need to invoke the following parameters in our Elasticsearch playbook :

So in file: roles/elastic/vars/main.yml
elasticsearch_gateway.recover_after_nodes: 1
elasticsearch_gateway.recover_after_time: 5m
elasticsearch_gateway.expected_nodes: 1

These are then set in the elasticsearch.yml.j2 file as follows:

# file: roles/elastic/templates/elasticsearch.yml.j2
#{{ ansible_managed }}


# Allow recovery process after N nodes in a cluster are up:
#gateway.recover_after_nodes: 2
{% if elasticsearch_gateway_recover_after_nodes is defined %}
gateway.recover_after_nodes : {{ elasticsearch_gateway_recover_after_nodes}}
{% endif %}

and so on ....

This allows the indexes to be recovered when there is only a single node in the cluster. See below for the state of my indexes after a reboot:

[root@elkhost01 elasticsearch]# curl -XGET http://localhost:9200/_cluster/health?pretty
 "cluster_name" : "nx-elastic",
 "status" : "yellow",
 "timed_out" : false,
 "number_of_nodes" : 1,
 "number_of_data_nodes" : 1,
 "active_primary_shards" : 4,
 "active_shards" : 4,
 "relocating_shards" : 0,
 "initializing_shards" : 0,
 "unassigned_shards" : 4,
 "delayed_unassigned_shards" : 0,
 "number_of_pending_tasks" : 0,
 "number_of_in_flight_fetch" : 0

Lets now look at the Kibana playbook I am attempting. Unfortunately, Kibana is distributed as a compressed tar archive. This means that the yum or dnf modules are no help here. There is however a very useful unarchive module, but first we need to download the tar bundle using get_url as follows :

- name: download kibana tar file
 get_url: url=https://download.elasticsearch.org/kibana/kibana/kibana-{{ kibana_version }}-linux-x64.tar.gz
 dest=/tmp/kibana-{{ kibana_version }}-linux-x64.tar.gz mode=755
 tags: kibana

I initially tried unarchiving the Kibana bundle into /tmp. I then intended to copy everything below the version specific directory (/tmp/kibana-4.0.1-linux-x64) into the Ansible created /opt/kibana directory. This proved problematic as neither the synchronize nor the copy modules seemed setup to do mass copy/transfer between one directory structure to another. Maybe I am just not getting it – I even tried using with_item loops but no joy as fileglobs are not recursive. Answers on a postcard are always appreciated? In the end I just did this :

- name: create kibana directory
 become: true
 file: owner=kibana group=kibana path=/opt/kibana state=directory
 tags: kibana

- name: extract kibana tar file
 become: true
 unarchive: src=/tmp/kibana-{{ kibana_version }}-linux-x64.tar.gz dest=/opt/kibana copy=no
 tags: kibana

The next thing to do was to create a systemd service unit. There isn’t one for Kibana as there is no rpm package available. Usual templating applies here :

- name: install kibana as systemd service
 become: true
 template: src=kibana4.service.j2 dest=/etc/systemd/system/kibana4.service owner=root \
           group=root mode=0644
 - restart kibana
 tags: kibana

And the service unit file looked like:

[ansible@ansible-host01 templates]$ cat kibana4.service.j2
{{ ansible_managed }}

ExecStart=/opt/kibana/kibana-{{ kibana_version }}-linux-x64/bin/kibana


This all seemed to work as I could now access Kibana via my browser. No indexes yet of course :


There are one or two plays I would like still like to document. Firstly, the ‘notify’ actions in some of the plays. These are used to call – in my case – the restart handlers. Which in turn causes the service in question to be restarted – see the next section :

# file: roles/kibana/handlers

- name: restart kibana
 become: true
 service: name=kibana state=restarted

I wanted to document this next feature simply because it’s so useful – tags. I have assigned a tag to every play/task in the playbook so far you will have noticed. For testing purposes they allow you to run specific plays. You can then troubleshoot just that particular play and see what’s going on.

 ansible-playbook -i ./production site.yml --tags "kibana" --ask-sudo-pass

Now that I have the basic plays to get my Elasticsearch and Kibana services up and running via Ansible, it’s time to start looking at Logstash. Next time I post on ELK type stuff, I will try to look at logging and search use cases. Once I crack how they work of course.

ELK on Nutanix : Elasticsearch

In this second post on using Ansible to deploy the ELK stack on Nutanix, I will cover my initial draft at a playbook for Elasticsearch (ES).  Recall from my previous post, the playbook layout looks like:

[ansible@ansible-host01 roles]$ tree elastic
├── files
│   └── elasticsearch.repo
├── handlers
│   └── main.yml
├── tasks
│   └── main.yml
├── templates
│   ├── elasticsearch.default.j2
│   ├── elasticsearch.in.sh.j2
│   └── elasticsearch.yml.j2
└── vars
 └── main.yml

There’s also an additional role at play here, config – which is the basic config for the underlying VM guest OS, which we also need to look at :

[ansible@ansible-host01 roles]$ tree config
├── files
├── handlers
├── tasks
│   └── main.yml
├── templates
└── vars
 └── main.yml

the common role is where I set things via the Ansible sysctl module, or add entries to files (using lineinfile) in order to set max memory and ulimits etc. It’s generic system configuration, so for example:

#installing java runtime pkgs (pre-req for ELK)
- name: install java 8 runtime
 become: true
 yum: name=java state=installed
 tags: config

#set system max/min numbers...
- name: set maximum map count in sysctl/systemd
 become: true
 sysctl: name=vm.max_map_count value={{ os_max_map_count }} state=present
 tags: config


- name: set soft limits for open files
 become: true
 lineinfile: dest=/etc/security/limits.conf line="{{ elasticsearch_user }} soft nofile {{ elasticsearch_max_open_files }}" insertafter=EOF backup=yes
 tags: config

- name: set max locked memory
 become: true
 lineinfile: dest=/etc/security/limits.conf line="{{ elasticsearch_user }} - memlock {{ elasticsearch_max_locked_memory }}" insertafter=EOF backup=yes
 tags: config


Here might be a good time to touch upon how Ansible allows you to set variables. Within the directory of each role there’s a subdir called vars and all the variables needed for that role are contained in the YAML file (main.yml). Here’s a snippet:

# can use vars to set versioning and user 
elasticsearch_version: 1.7.0
elasticsearch_user: elasticsearch

# here's how we can specify the data volumes that ES will use 
elasticsearch_data_dir: /esdata/data01,/esdata/data02,/esdata/data03,/esdata/data04,/esdata/data05,/esdata/data06


# Virtual memory settings - ES heap is set to half my current VM RAM
# but no greater than 32GB for performance reasons
elasticsearch_heap_size: 16g
elasticsearch_max_locked_memory: unlimited
elasticsearch_memory_bootstrap_mlockall: "true"


# Good idea not to go with the ES default names of Franz Kafka etc
elasticsearch_cluster_name: nx-elastic
elasticsearch_node_name: nx-esnode01

# My initial nodes will be both cluster quorum members and data "workhorse" nodes.
# I will # separate duties as I scale. Also I set the min master nodes to 1 so that 
# my ES cluster comes up while initially testing a single index 
elasticsearch_node_master: "true"
elasticsearch_node_data: "true"
elasticsearch_discovery_zen_minimum_master_nodes: 1

We’ll see how we use these variables as we cover more features. Next up I used some nice features like shell and also register variables to be able to provide conditional behaviour for package install :

- name: check for previous elasticsearch installation
 shell: if [ -e /usr/share/elasticsearch/lib/elasticsearch-{{ elasticsearch_version }}.jar ]; then echo yes; else echo no; fi;
 register: version_exists
 always_run: True
 tags: elastic

- name: uninstalling previous version if applicable
 become: true
 command: yum erase -y elasticsearch
 when: version_exists.stdout == 'no'
 ignore_errors: true
 tags: elastic

and similarly for the marvel plugin :

- name: check marvel plugin installed
 become: true
 stat: path={{ elasticsearch_home_dir }}/plugins/marvel
 register: marvel_installed
 tags: elastic

- name: install marvel plugin
 become: true
 command: "{{ elasticsearch_home_dir }}/bin/plugin -i elasticsearch/marvel/latest"
 - restart elasticsearch
 when: not marvel_installed.stat.exists
 tags: elastic

The Marvel plugin stanza above also makes use of the stat module – this is a really great module. It returns all kinds of goodness you would normally expect from a stat() system call and yet you are doing it in your Ansible playbook.  There are a couple more things I will cover and then leave the rest for when I talk about Kibana and Logstash in a follow up post. First up then are templates. Ansible uses Jinja2 templating in order to transform a file and install it on your host, you can create a file with appropriate templating as below. The variables in {{ .. }} are from the roles ../var directory containing the yaml file already described earlier.

Note : I stripped all comment lines for sake of brevity:

[ansible@ansible-host01 templates]$ pwd
[ansible@ansible-host01 templates]$ grep -v ^# elasticsearch.yml.j2
{% if elasticsearch_cluster_name is defined %}
cluster.name: {{ elasticsearch_cluster_name }}
{% endif %}


{% if elasticsearch_node_name is defined %}
node.name: {{ elasticsearch_node_name }}
{% endif %}


{% if elasticsearch_node_master is defined %}
node.master: {{ elasticsearch_node_master }}
{% endif %}
{% if elasticsearch_node_data is defined %}
node.data: {{ elasticsearch_node_data }}
{% endif %}


{% if elasticsearch_memory_bootstrap_mlockall is defined %}
bootstrap.mlockall: {{ elasticsearch_memory_bootstrap_mlockall }}
{% endif %}

The template  file when run in the play is then transformed using the provided variables and copied into place on my  ELK host target VM…

- name: copy elasticsearch defaults file
 become: true
 template: src=elasticsearch.default.j2 dest=/etc/sysconfig/elasticsearch owner={{ elasticsearch_user }} group={{ elasticsearch_group }} mode=0644
 - restart elasticsearch
 tags: elastic

So let’s see how our playbook runs and what the output looks like

[ansible@ansible-host01 elk]$ ansible-playbook -i ./production site.yml \
--tags "config,elastic" --ask-sudo-pass
SUDO password:

PLAY [elastic-hosts] **********************************************************

GATHERING FACTS ***************************************************************
ok: []

TASK: [config | install java 8 runtime] ***************************************
ok: []

TASK: [config | set swappiness in sysctl/systemd] *****************************
ok: []

TASK: [config | set maximum map count in sysctl/systemd] **********************
ok: []

TASK: [config | set hard limits for open files] *******************************
ok: []

TASK: [config | set soft limits for open files] *******************************
ok: []

TASK: [config | set max locked memory] ****************************************
ok: []

TASK: [config | Install wget package (Fedora based)] **************************
ok: []

TASK: [elastic | install elasticsearch signing key] ***************************
changed: []

TASK: [elastic | copy elasticsearch repo] *************************************
ok: []

TASK: [elastic | check for previous elasticsearch installation] ***************
changed: []

TASK: [elastic | uninstalling previous version if applicable] *****************
skipping: []

TASK: [elastic | install elasticsearch pkgs] **********************************
skipping: []

TASK: [elastic | copy elasticsearch configuration file] ***********************
ok: []

TASK: [elastic | copy elasticsearch defaults file] ****************************
ok: []

TASK: [elastic | set max memory limit in systemd file (RHEL/CentOS 7+)] *******
changed: []

TASK: [elastic | set log directory permissions] *******************************
ok: []

TASK: [elastic | set data directory permissions] ******************************
ok: []

TASK: [elastic | ensure elasticsearch running and enabled] ********************
ok: []

TASK: [elastic | check marvel plugin installed] *******************************
ok: []

TASK: [elastic | install marvel plugin] ***************************************
skipping: []

NOTIFIED: [elastic | restart elasticsearch] ***********************************
changed: []

PLAY RECAP ******************************************************************** : ok=19 changed=4 unreachable=0 failed=0

[ansible@ansible-host01 elk]$

I can verify that my ES cluster is working by querying the Cluster API – note that the red status is down to the fact I have no other cluster nodes yet on which to replicate the index shards:

# curl -XGET http://localhost:9200/_cluster/health?pretty
 "cluster_name" : "nx-elastic",
 "status" : "red",
 "timed_out" : false,
 "number_of_nodes" : 1,
 "number_of_data_nodes" : 1,
 "active_primary_shards" : 0,
 "active_shards" : 0,
 "relocating_shards" : 0,
 "initializing_shards" : 0,
 "unassigned_shards" : 0,
 "delayed_unassigned_shards" : 0,
 "number_of_pending_tasks" : 0,
 "number_of_in_flight_fetch" : 0

You can use further API queries to verify that the desired configuration is in place and at that point you have a solid, repeatable deployment with a known outcome ie: you are doing DevOps.