Tag Archives: replica

Sharded MongoDB config on Nutanix (2) : High Availability

One of the prime availability considerations for any horizontal scale out application, like a MongoDB cluster, is how that cluster behaves under a failure event. We have seen (in the MongoDB case) how replica sets are configured with additional secondary instances to handle the failure of a primary instance in a replica set. We also create a mini quorum of configuration database servers and query routers to give redundancy to the cluster “infrastructure”. However, the Nutanix XCP environment provides further protection through certain features of the Acropolis management interface. Your key VMs need to be enabled to run under high availability. This is so that when the underlying hypervisor host fails for any reason, these VM’s failover to another host with sufficient CPU and RAM resources. The screenshot below shows how this (Tech Preview) feature can be enabled (pre NOS 4.5) on a per-VM basis :

enable-HA

The underlying migration functionality is also used for the manual placement of key VMs. As an example, let’s consider the following layout, where two of the configdb VMs in a MongoDB cluster are co-located on the same AHV host:

mongodb-colocated-vms

Notice in the screen capture above, there are two configdb VMs on host “D”. This means that ideally we want to migrate a MongoDB Config DB to another AHV host. Let’s move the VM mongo-configdb02 to AHV host “C”…

mongodb-migrate-VM

Note that the migration process could have automatically chosen an appropriate AHV host to receive the VM. In the above case however, we have instead specified the desired host ourselves.

We can monitor the progress and duration of any migration via the VM tasks frame in Prism:

mongodb-vm-tasks-migration

As always, this workflow can also be done manually (or scripted) through the acli interface. In this example I am migrating the VM running a query router (mongos process)….

<acropolis> vm.migrate mongos01 host=10.68.64.41 
mongos01: complete 
<acropolis>

As of the time of writing this post. Acropolis Base Software (NOS) 4.5 has been released and this feature has become part of general availability (GA). It can now be enabled cluster wide:

ha-enable-menu-4

enable-ha-4

Nutanix customers are strongly recommended to enable this feature when they require HA functionality for their VMs.

In my next post I will be completing this short blog series on sharded MongoDB configs on Nutanix. I intend to cover how Nutanix Acropolis managed snapshots and cloning are employed to create backups and then use them to perform rapid build out of potential dev/QA type environments. Stay tuned.

 

 

 

Sharded MongoDB config on Nutanix (1) : Deployment

So far I have posted on MongoDB deployments either as standalone or as part of a replica set. This is fine when you can size your VM memory to hold the entire database working set. However, if your VM’s RAM will not accommodate the working set in memory, you will need to shard to aggregate RAM from multiple replica sets and form a MongoDB cluster.

Having already discussed using clones of gold image VMs to create members for a replica set, then the most basic of MongoDB clusters requires at least two replica sets. On top of which we need a number of MongoDB “infrastructure” VMs that make MongoDB cluster operation possible. These entail a minimum of three (3) Configuration Databases (mongod –configsvr) per cluster and around one (1) Query Router (mongos) for every two shards. Here is the layout of a cluster deployment on my lab system:

2shard-system

In the above lab deployment, for availability considerations, I avoid co-locating any primary replica VM on the same physical host, and likewise any of the Query Router or ConfigDB VMs. One thing to bear in mind is that sharding is done on a per collection basis. Simply put, the idea behind sharding is that you split the collections across the replica sets and then by connecting to a mongos process you are routed to the appropriate shard holding the part of the collection that can serve your query. The following commands show the syntax to create one of the three required configdb’s (ran on three separate VMs, and need to be started first), and a Query Router, or mongos process (where we add the IP addresses of each configdb server VM) :

Config DB Servers – each ran as:
mongod --configsvr --dbpath /data/configdb --port 27019

Query Router - ran as:
mongos --configdb 10.68.64.142:27019,10.68.64.143:27019,10.68.64.145:27019

- the above IP addresses in mongos command line are the addresses of each config DB.

This brings up an issue if you are not cloning replica VMs from “blank” gold VMs. By cloning a new replica set from a current working replica set, ie: so that you essentially have each replica set holding a full copy of all your databases and their collections. Then when you come to add such a replica set as a shard, you generate the error condition shown below.

Here’s the example of what can happen when you attempt to shard and your new replica set (rs02)  is simply cloned off a current running replica set (rs01):

mongos> sh.addShard("rs02/192.168.1.52")
{s
 "ok" : 0,
 "errmsg" : "can't add shard rs02/192.168.1.52:27017 because a local database 'ycsb' 
exists in another rs01:rs01/192.168.1.27:27017,192.168.1.32:27017,192.168.1.65:27017"
}

This is the successful workflow adding both shards (the primary of each replica set) via the mongos router VM:

$ mongo --host localhost --port 27017
MongoDB shell version: 3.0.3
connecting to: localhost:27017/test
mongos>
 
mongos> sh.addShard("rs01/10.68.64.111")
{ "shardAdded" : "rs01", "ok" : 1 }
mongos> sh.addShard("rs02/10.68.64.110")
{ "shardAdded" : "rs02", "ok" : 1 }

We next need to enable sharding on the database and subsequently shard on the collection we want to distribute across the replica sets available. The choice of shard key is crucial to future MongoDB cluster performance. Issues such as read and write scaling, cardinality etc are covered here. For my test cluster I am using the _id field for demonstration purposes.

mongos> sh.enableSharding("ycsb")
{ "ok" : 1 }

mongos> sh.shardCollection("ycsb.usertable", { "_id": 1})
{ "collectionsharded" : "ycsb.usertable", "ok" : 1 }

The balancer process will run for the period of time needed to migrate data between the available shards. This can take anywhere from a number of hours to a number of days depending on the size of the collection, the number of shards, the current workload etc. Once complete however, this results in the following sharding status output. Notice  the “chunks” of the usertable collection held in the ycsb database are now shared across both shards (522 chunks in each shard) :

 mongos> sh.status()
--- Sharding Status ---
 sharding version: {
 "_id" : 1,
 "minCompatibleVersion" : 5,
 "currentVersion" : 6,
 "clusterId" : ObjectId("55f96e6c5dfc4a5c6490bea3")
}
 shards:
 { "_id" : "rs01", "host" : "rs01/10.68.64.111:27017,10.68.64.131:27017,10.68.64.144:27017" }
 { "_id" : "rs02", "host" : "rs02/10.68.64.110:27017,10.68.64.114:27017,10.68.64.137:27017" }
 balancer:
 Currently enabled: yes
 Currently running: no
 Failed balancer rounds in last 5 attempts: 0
 Migration Results for the last 24 hours:
 No recent migrations
 databases:
 { "_id" : "admin", "partitioned" : false, "primary" : "config" }
 { "_id" : "enron_mail", "partitioned" : false, "primary" : "rs01" }
 { "_id" : "mydocs", "partitioned" : false, "primary" : "rs01" }
 { "_id" : "sbtest", "partitioned" : false, "primary" : "rs01" }
 { "_id" : "ycsb", "partitioned" : true, "primary" : "rs01" }
 ycsb.usertable
 shard key: { "_id" : 1 }
 chunks:
 rs01 522
 rs02 522
 too many chunks to print, use verbose if you want to force print
 { "_id" : "test", "partitioned" : false, "primary" : "rs02" }

Additional Links:

 

 

 

 

 

 

Using Nutanix snapshots to backup MongoDB

In an earlier post I described how Nutanix clones can speed up deployment workflows for MongoDB replica sets. In this post we will cover the Nutanix VM-centric snapshot functionality to backup a MongoDB database instance. Even when the database is sharded, the ability to take a snapshot of a set of VMs at once (as part of a consistency group) ensures that a point in time (PIT) backup can easily be taken.

Let’s first consider the 3 member replica set as described in my last post. In order to take an OS level backup we would need to quiesce the I/O to one of the secondary members of the replica set. The command sequence below is for MMAPv1 storage engine only. The db.fsyncLock() flushes all pending writes and locks the instance for further writes until we run db.fsyncUnlock(). Even though we have journaling enabled, we have separated the I/O for data and journal files to different volumes (as part of our standard config). Hence the need to quiesce I/O to make sure we take a consistent snapshot.

rs01:SECONDARY> db.fsyncLock()
{
 "info" : "now locked against writes, use db.fsyncUnlock() to unlock",
 "seeAlso" : "http://dochub.mongodb.org/core/fsynccommand",
 "ok" : 1
}

Nutanix snapshots work at the VM level. This means we take a snapshot of all the vDisks in the VM hosting a MongoDB secondary at the same time. However, MongoDB can’t guarantee backup consistency if the journal and data are located in separate volumes. Hence, we still need db.fsyncLock()/db.fsyncUnlock().

<acropolis> vm.snapshot_create mongodb03
SnapshotCreate: complete

<acropolis> vm.snapshot_create mongodb03 snapshot_name_list=mongodb03-snap1
SnapshotCreate: complete

<acropolis> vm.snapshot_list mongodb03
Snapshot name Snapshot UUID
mongodb03-snap1 41c77ddc-8cc7-49bf-a250-23d52031b76e
mongodb03_2015-09-08T10:04:00.805780 7cb77f7f-1eb6-4103-8909-e8e50c318151

Above we have taken two snapshots (by way of example) and show the naming conventions returned for both a named snapshot and the default. See the rest of the workflow below…

<acropolis> vm.clone mongodb04 clone_from_snapshot=mongodb03-snap1
mongodb03-backup: complete

<acropolis> vm.restore mongodb03 mongodb03-snap1
VmRestore: complete

<acropolis> snapshot.delete mongodb03-snap1
Delete 1 snapshots? (yes/no) y
Please type 'yes' or 'no': yes
mongodb03-snap1: complete

We can carry out the same workflow via Nutanix Prism. I have taken a screenshot of the available commands (see below) in the VM Snapshots frame. The command sequence is issued via the  VM tab in the GUI. The desired CLI arguments such as clone and snapshot names for example, are entered in the resulting popups. For reasons of brevity I have not included all the required screenshots.

vm-snapshot-actions

Having taken our snapshot, we can now unlock the replica – make sure you have kept the session where you called the db.fsyncLock() command open until this time.

rs01:SECONDARY> db.fsyncUnlock()
{ "ok" : 1, "info" : "unlock completed" }
rs01:SECONDARY>

Any newly created clones from the Nutanix VM snapshot can be used to seed a test environment with production strength data. Bearing in mind that the backup/clone was made from a VM that formed part of a replica set, we would like to be able to start the MongoDB instance in standalone mode. You can do this as follows ….

  • power off the newly created clone (mongodb04)
  • Login and create a sub-directory in the MongoDB data directory in order to save the local db files to…
[root@mongodb04 data]# mkdir local-old
[root@mongodb04 data]# mv local.* local-old/
  • edit /etc/mongod.conf : comment out the line naming the replica set and make sure the bind_ip reflects the IP address assigned to the VM (via DHCP in my case)
[mongod@mongodb04 ~]$ ip a | grep eth0
2: eth0: <broadcast,multicast,up,lower_up> mtu 1500 qdisc pfifo_fast state UP qlen 1000
 inet 10.68.64.138/24 brd 10.68.64.255 scope global eth0

[mongod@mongodb04 ~]$ egrep "bind_ip|replSet" /etc/mongod.conf
bind_ip=127.0.0.1,10.68.64.138
#replSet=rs01
  • Start up the mongod (now as a standalone) instance with the newly edited configuration file
  • Drop into a mongo shell. You can then verify the consistency of the copied data (in my example I am using a synthetic database created by the YCSB benchmark software) …
[mongod@mongodb04 ~]$ mongo
MongoDB shell version: 3.0.3
connecting to: test
>
> show dbs
enron_mail 3.952GB
local 0.078GB
ycsb 207.853GB
>
> use ycsb
switched to db ycsb
>
> show collections
system.indexes
usertable
>
> db.usertable.findOne()
{
 "_id" : "user6284781860667377211",
 "field5" : BinData(0,"PTo4JCo4JiskPDU2PiIzPikmMycrNSQjJzo3NiEqJjImIyY0PSE5KS43ICM4Liw+MCclLyE1JCA3PDM3OCsoMz0nKD8rISYrKCw8Pik4JT0jIyYuKDkoOy4gJCkoPSMkOywnNA=="),
 "field4" : BinData(0,"PyI/NSIqITEhKzQxPTwlNyIzJy0zLz05IzUvLDkwLi0uNjQwKysnMSQ5Lj8hMDozPDo+JTg/IDw6NTI9PzAkNjIjLTw8OyYvOTsgLTM2IjcvPCk4Ij84Ny45LiYhJC0mKT0qKw=="),
 "field3" : BinData(0,"MTkwLj0sPTYtPTg7OTUmIz0xJSM3OzM+Kyw3PyckPDYnMyotMC84JDIjMzEnITktJi4mMz0rLTkzODkoPCUkMT4pJzEpKiM8JDUlKjUxLiY5PzIxIiM0OCgpLTUpLyQ1JCogLw=="),
 "field2" : BinData(0,"Pz4jNTcnLDgwKS04ISc7Oi01IDkrNDclPTsuMyElOSU/KywvMzY6PDU0Jyo3NyYjKig8NzMrOTs7LiQiMy0hKioiPiMkPTgyLDAyPSoxJS4tOiUyLzE9Pz8rJy0zPjo2PCszNA=="),
 "field9" : BinData(0,"KTknMSs2PyEiMjUiPz4qKiM9MS0iIj4vMyMyNCsjPSUkJyYsJz0nLD0/NSY3MyQ7KCo/OikiNyU3MjQ8MCArLj8wMyM0MiAgJiUzOCc5KjYiJTw8Izw+OzA7NCAiPzA5ISYxOg=="),
 "field8" : BinData(0,"KCgoLTQ4ICE0IDY0ODAyIS86LiYrKzQmIyIpJy0nKz0gICYpMTQsLT86PyAiKjs/Kzw3MSciJDYkJTssLiMsLiUtLzMtOiYpND8nMCQpOzopMyQ1OyYsNy8uPysxNyg1PCQ6MA=="),
 "field7" : BinData(0,"ITM1KiUmKCcnKjA8IT8xPCIjKzc4Ijs+IyYhICMlKCsjPy09MygxKTohPDgjLjIzMzwpPj45Oyg2JTgsLjYlJS0xPCc+IiolKDw0JiQkLyg4NiohID0xPjIyMic9KDgqKD0zKg=="),
 "field6" : BinData(0,"IjovMTotODUxPD0mPikqLyo4IzonKjc+LDwuLCEhICghOzYxJTw5PiAzIzktNjA9OzUqNiovMTszMDMqLDQmNjk8IDA0NjAvPz8mPygnITU+OSY8JCIzLDAgOSEmNyA1KCQ+Jw=="),
 "field1" : BinData(0,"ID0sMjg8JiAiKz4gITYjNCkxLSs+IC8/JTAiNCA8PignMTEjNyogKyU0Lz4lNz0mIzohNCQ8LSwhICw/OT82PCYyPjk2KSElIDo5MSU2Kig9NiUnOjwpLi0vOjIqJz0sKDExOw=="),
 "field0" : BinData(0,"OikgJykuOTchIjUlISghLCkwOis/Pyw4ISo7Jyw6Kz46NTwlODIsKDQqLT0uNi0rOTcsOzA4Lzo3PDcyMjozLiItKSY1JSsoLSIkPio7KC4xOzs7MzY7Oj0nOzU+IyEuJDMwJA==")
}

The VM can now be easily migrated anywhere across the Nutanix cluster using the features of the Nutanix Distributed filesystem (NDFS). If additional availability is required then we can clone “blank” gold image VMs and form a new replica set. To avoid confusion the naming convention of the new replica set(s) can reflect the intended use case perhaps : dev, test, QA etc.

Using Nutanix clones to deploy MongoDB replica set

In this post I am going to look at setting up a replica set to support high availability in a MongoDB environment. Replica sets contain a primary MongoDB database and a number of additional secondary replica databases. Any one of the allowed replicas can become primary in the event that the original primary fails for whatever reason. Replica set membership count is usually an odd number in order that new primary elections are not tied.

Building out an HA MongoDB setup on Nutanix is relatively easy to do. Each MongoDB instance is hosted in a separate, sandboxed environment. In our case a virtual machine (VM). Each VM is then located on a separate physical hypervisor host. I have a gold image VM that has a MongoDB instance installed along recommended best practice guidelines. This VM gets cloned as required when I need to build out a new MongoDB environment. So for a 3 member replica set I need 3 clones.

three-replicaset

From a cluster CVM node type:

$ acli 

<acropolis> vm.clone mongodb01,mongodb02,mongodb03 clone_from_vm=mongodb30-gold
mongodb01: complete
mongodb02: complete
mongodb03: complete
 
<acropolis> vm.list
...
mongodb01: 2b9498c1-502e-454e-93c8-931a45a321b6
mongodb02: 9a445d26-caf9-4ddf-9d8e-296ea8b6e19e
mongodb03: 9a5512fa-3d19-4ddc-8cac-11721f999459
...

<acropolis> vm.on mongodb01,mongdb02,mongodb03
mongodb01: complete
mongodb01: complete
mongodb01: complete

After powering on the VMs, check that mongod starts correctly on default port 27017 on each VM. First thing to make sure is that the mongod process is listening on the correct address. I have set my VMs to use DHCP and this is the address that the service needs to listen on.

# ip a

2: eth0: <broadcast,multicast,up,lower_up> mtu 1500 qdisc pfifo_fast state UP qlen 1000
 link/ether 52:54:00:db:17:76 brd ff:ff:ff:ff:ff:ff
 inet 10.68.64.111/24 brd 10.68.64.255 scope global eth0


# cat /etc/mongod.conf | grep -i bind_ip
 bind_ip=127.0.0.1,10.68.64.111

# service mongod restart
# service mongod status

Once all of the VMs are up and running on their respective address:port tuples, make sure that we enable firewall access via iptables. Each VM, that will form part of the replica set, needs to allow access to the other members via mongod port 27017. So for a replica set with members 10.68.64.111, 10.68.64.114, 10.68.64.113, then for each member, in this example 10.68.64.111, run…

# iptables -A INPUT -s 10.68.64.113 -p tcp --destination-port 27017 -m state \
--state NEW,ESTABLISHED -j ACCEPT
# iptables -A INPUT -s 10.68.64.114 -p tcp --destination-port 27017 -m state \
--state NEW,ESTABLISHED -j ACCEPT

# service iptables save
iptables: Saving firewall rules to /etc/sysconfig/iptables:[ OK ]
# service iptables reload

abridged iptables -L output after the above changes….

Chain INPUT (policy ACCEPT)
target prot opt source destination
ACCEPT tcp -- 10.68.64.113 anywhere tcp dpt:27017 state NEW,ESTABLISHED
ACCEPT tcp -- 10.68.64.114 anywhere tcp dpt:27017 state NEW,ESTABLISHED

Check access by performing a series of bi-directional tests between all the replica set members:

<10.68.64.111>$ mongo --host 10.68.64.113 --port 27017
MongoDB shell version: 3.0.3
connecting to: 10.68.64.113:27017/test
>
> quit()

Should any of the connection tests fail then revisit the iptables entries. Usual troubleshooting applies with telnet or nc, netstat etc.

In order to create the replica set, connect via ssh to each VM and edit the mongod.conf to include the replSet functionality:

$ grep -i replSet /etc/mongod.conf
replSet=rs01

Restart the mongod process (sudo service mongod restart) and then start a mongo shell session, the first member of the set (primary) needs to run :

$ mongo
MongoDB shell version: 3.0.3
connecting to: test
> rs.initiate()
{
 "info2" : "no configuration explicitly specified -- making one",
 "me" : "10.68.64.111:27017",
 "ok" : 1
}
rs01:PRIMARY>

You can use the shell commands rs.conf() and rs.status() to check the replica set at any point. We’ll look at one of these outputs after completing the replica set creation. Next, from the same mongo shell session, add the other two replica nodes:

rs01:PRIMARY> rs.add("10.68.64.113")
{ "ok" : 1 }

rs01:PRIMARY> rs.add("10.68.64.114")
{ "ok" : 1 }

Potential error scenarios

  •  if you didn’t clone the VMs for the replica set from a blank gold image but rather from a VM already running a replicated mongodb configuration. Then the replication commands report errors similar to this :
{
 "info2" : "no configuration explicitly specified -- making one",
 "me" : "10.68.64.111:27017",
 "info" : "try querying local.system.replset to see current configuration",
 "ok" : 0,
 "errmsg" : "already initialized",
 "code" : 23
}

On the proviso that this is a greenfield install, delete the local db config files in the data directory and re-run the rs.initiate()

  • if the firewall rules are not set correctly then the following error message is thrown:
 "errmsg" : "Quorum check failed because not enough voting nodes responded; 
required 2 but only the following 1 voting nodes responded: 10.68.64.111:27017; 
the following nodes did not respond affirmatively: 
10.68.64.131:27017 failed with Failed attempt to connect to 10.68.64.131:27017; 
couldn't connect to server 10.68.64.131:27017 (10.68.64.131), 
connection attempt failed",

Ensure that the firewall rules allow proper access between the VM’s.

  • if replication is not enabled correctly in the mongod configuration files on each host of the replica set :
"errmsg" : "Quorum check failed because not enough voting nodes responded; 
required 2 but only the following 1 voting nodes responded: 10.68.64.110:27017; 
the following nodes did not respond affirmatively: 
10.68.64.114:27017 failed with not running with --replSet",

Once the replica set configuration is complete, check the setup by running rs.status() or rs.conf() to confirm :

rs01:PRIMARY> rs.conf()
{
 "_id" : "rs01",
 "version" : 3,
 "members" : [
 {
 "_id" : 0,
 "host" : "10.68.64.111:27017",
 "arbiterOnly" : false,
 "buildIndexes" : true,
 "hidden" : false,
 "priority" : 1,
 "tags" : {

 },
 "slaveDelay" : 0,
 "votes" : 1
 },
 {
 "_id" : 1,
 "host" : "10.68.64.113:27017",
 "arbiterOnly" : false,
 "buildIndexes" : true,
 "hidden" : false,
 "priority" : 1,
 "tags" : {

 },
 "slaveDelay" : 0,
 "votes" : 1
 },
 {
 "_id" : 2,
 "host" : "10.68.64.114:27017",
 "arbiterOnly" : false,
 "buildIndexes" : true,
 "hidden" : false,
 "priority" : 1,
 "tags" : {

 },
 "slaveDelay" : 0,
 "votes" : 1
 }
 ],
 "settings" : {
 "chainingAllowed" : true,
 "heartbeatTimeoutSecs" : 10,
 "getLastErrorModes" : {

 },
 "getLastErrorDefaults" : {
 "w" : 1,
 "wtimeout" : 0
 }
 }
}

From the output above we can see the full replica set membership, both the member function and status. Things like priority settings and whether or not the replica is hidden to user applications queries etc. Also, whether a replica is a full mongod instance or an arbiter (simply there to mitigate against primary election ties). Or, if any of the replicas have a delay enabled (used for backup/reporting duties).

In an earlier post I have shown the available mongo shell commands to calculate the working set for the database. For read intensive workloads, where your working set is sized to fit available RAM in the mongod server VMs; a replica set deployment can be used to run MongoDB and support high availability.