As Redis writes to disk asychronously, it is not sufficient to monitor whether it is still listening to TCP/IP connections - you must check whether it is managing to persist the in-memory data to disk.
However, as it is not currently possible to deterministically monitor the status of these background saves, we are limited to checking that the number of seconds since the last save has not exceeded some specific number. This is possible by using last_save_time which is available via the INFO command:
$ redis-cli info | grep last_save_time last_save_time:1312045364
Checking for this has already saved me twice: once when my vm.overcommit_memory = 1 change was accidentally rolled back (see "Background saving is failing with a fork() error" in the Redis FAQ), and again when the machine simply did not have enough disk space.
To monitor this with Nagios:
#!/usr/bin/env python import sys import time import redis def main(host, warning, critical): try: client = redis.Redis(host=host) last_save = time.time() - client.info()['last_save_time'] except Exception, e: print "CRITICAL: %s" % e return 2 ret = 0 state = 'OK' for limit, state_, ret_ in ( (int(warning), 'WARNING', 1), (int(critical), 'CRITICAL', 2), ): if last_save > limit: ret = ret_ state = state_ print "%s: Last dump was %d seconds ago" % (state, last_save) return ret if __name__ == '__main__': sys.exit(main(*sys.argv[1:]))
The following invokation will trigger a warning when we haven't saved in an hour and become critical if we haven't saved within 24 hours. (If we cannot connect to the server we are immediately critical, so we avoid an extra check just for this.)
$ check_redis 127.0.0.1 3600 86400
This setup works well, although we are now monitoring a non-deterministic heuristic. To remedy this, Redis could expose background save failures in the INFO output (last_bgsave_status=fail, perhaps?). However, the treatment of previous contributions puts me off sending further patches.