r.va.gg

Testing code against many Node versions with Docker

I haven't found reason to play with Docker until now, but I've finally came up with an excellent use-case.

NAN is a project that helps build native Node.js add-ons while maintaining compatibility with Node and V8 from Node versions 0.8 onwards. V8 is currently undergoing major internal changes which is making add-on development very difficult; NAN's purpose is to abstract that pain. Instead of having to manage the difficulties of keeping your code compatible across Node/V8 versions, NAN does it for you. But this means that we have to be sure to keep NAN tested and compatible with all of the versions it claims to support.

Travis can help a little with this. It's possible to use nvm to test across different versions of Node, we've tried this with NAN (see the .travis.yml). Ideally you'd have better choice of Node versions, but Travis have had some difficulty keeping up. Also, npm bugs make it difficult, with a high failure rate from npm install problems, like this and this, so I don't even publish the badge on the NAN README.

The other problem with Travis is that it's a CI solution, not a proper testing solution. Even if it worked well, it's not really that helpful in the development process, you need rapid feedback that your code is working on your target platforms (this is one reason why I love back-end development more than front-end development!)

Enter Docker and DNT

DNT: Docker Node Tester

Docker is a tool that simplifies the use of Linux containers to create lightweight, isolated compute "instances". Solaris and its variants have had this functionality for years in the form of "zones" but it's a fairly new concept for Linux and Docker makes the whole process a lot more friendly.

DNT contains two tools that work with Docker and Node.js to set-up containers for testing and run your project's tests in those containers.

DNT includes a setup-dnt script that sets up the most basic Docker images required to run Node.js applications, nothing extra. It first creates an image called "dev_base" that uses the default Docker "ubuntu" image and adds the build tools required to compile and install Node.js

Next it creates a "node_dev" image that contains a complete copy of the Node.js source repository. Finally, it creates a series of images that are required; for each Node version, it creates an image with Node installed and ready to use.

Setting up a project is a matter of creating a .dntrc file in the root directory of the project. This configuration file involves setting a NODE_VERSIONS variable with a list of all of the versions of Node you want to test against, and this can include "master" to test the latest code from the Node repository. You also set a TEST_CMD variable with a series of commands required to set up, compile and execute your tests. The setup-dnt command can be run against a .dntrc file to make sure that the appropriate Docker images are ready. The dnt command can then be used to execute the tests against all of the Node versions you specified.

Since Docker containers are completely isolated, DNT can run tests in parallel as long as the machine has the resources. The default is to use the number of cores on the computer as the concurrency level but this can be configured if not appropriate.

Currently DNT is designed to parse TAP test output by reading the final line as either "ok" or "not ok" to report test status back on the command-line. It is configurable but you need to supply a command that will transform test output to either an "ok" or "not ok" (sed to the rescue?).

How I'm using it

My primary use-case is for testing NAN. The test suite needs a lot of work so being able to test against all the different V8 and Node APIs while coding is super helpful; particularly when tests run so quickly! My NAN .dntrc file tests against master, all of the 0.11 releases since 0.11.4 (0.11.0 to 0.11.3 are explicitly not supported by NAN) and the last 5 releases of the 0.10 and 0.8 series. At the moment that's 17 versions of Node in all and on my computer the test suite takes approximately 20 seconds to complete across all of these releases.

The NAN .dntrc

NODE_VERSIONS="\
  master   \
  v0.11.9  \
  v0.11.8  \
  v0.11.7  \
  v0.11.6  \
  v0.11.5  \
  v0.11.4  \
  v0.10.22 \
  v0.10.21 \
  v0.10.20 \
  v0.10.19 \
  v0.10.18 \
  v0.8.26  \
  v0.8.25  \
  v0.8.24  \
  v0.8.23  \
  v0.8.22  \
"
OUTPUT_PREFIX="nan-"
TEST_CMD="\
  cd /dnt/test/ &&                                               \
  npm install &&                                                 \
  node_modules/.bin/node-gyp --nodedir /usr/src/node/ rebuild && \
  node_modules/.bin/tap js/*-test.js;                            \
"

Next I configured LevelDOWN for DNT. The needs are much simpler, the tests need to do a compile and run a lot of node-tap tests.

The LevelDOWN .dntrc

NODE_VERSIONS="\
  master   \
  v0.11.9  \
  v0.11.8  \
  v0.10.22 \
  v0.10.21 \
  v0.8.26  \
"
OUTPUT_PREFIX="leveldown-"
TEST_CMD="\
  cd /dnt/ &&                                                    \
  npm install &&                                                 \
  node_modules/.bin/node-gyp --nodedir /usr/src/node/ rebuild && \
  node_modules/.bin/tap test/*-test.js;                          \
"

Another native Node add-on that I've set up with DNT is my libssh bindings. This one is a little more complicated because you need to have some non-standard libraries installed before compile. My .dntrc adds some extra apt-get sauce to fetch and install those packages. It means the tests take a little longer but it's not prohibitive. An alternative would be to configure the node_dev base-image to have these packages to all of my versioned images have them too.

The node-libssh .dntrc

NODE_VERSIONS="master v0.11.9 v0.10.22"
OUTPUT_PREFIX="libssh-"
TEST_CMD="\
  apt-get install -y libkrb5-dev libssl-dev &&                           \
  cd /dnt/ &&                                                            \
  npm install &&                                                         \
  node_modules/.bin/node-gyp --nodedir /usr/src/node/ rebuild --debug && \
  node_modules/.bin/tap test/*-test.js --stderr;                         \
"

LevelUP isn't a native add-on but it does use LevelDOWN which requires compiling. For the DNT config I'm removing node_modules/leveldown/ prior to npm install so it gets rebuilt each time for each new version of Node.

The LevelUP .dntrc

NODE_VERSIONS="\
  master   \
  v0.11.9  \
  v0.11.8  \
  v0.10.22 \
  v0.10.21 \
  v0.8.26  \
"
OUTPUT_PREFIX="levelup-"
TEST_CMD="\
  cd /dnt/ &&                                                    \
  rm -rf node_modules/leveldown/ &&                              \
  npm install --nodedir=/usr/src/node &&                         \
  node_modules/.bin/tap test/*-test.js --stderr;                 \
#"

What's next?

I have no idea but I'd love to have helpers flesh this out a little more. It's not hard to imagine this forming the basis of a local CI system as well as a general testing tool. The speed even makes it tempting to run the tests on every git commit, or perhaps on every save.

If you'd like to contribute to development then please submit a pull request, I'd be happy to discuss anything you might think would improve this project. I'm keen to share ownership with anyone making significant contributions; as I do with most of my open source projects.

See the DNT GitHub repo for installation and detailed usage instructions.

LevelDOWN v0.10 / managing GC in native V8 programming

LevelDB

Today we released version 0.10 of LevelDOWN. LevelDOWN is the package that directly binds LevelDB into Node-land. It's mainly C++ and is a fairly raw & direct interface to LevelDB. LevelUP is the package that we recommend most people use for LevelDB in Node as it takes LevelDOWN and makes it much more Node-friendly, including the addition of those lovely ReadStreams.

Normally I wouldn't write a post about a minor release like this but this one seems significant because of a number of small changes that culminate in a relatively major release.

In this post:

  • V8 Persistent references
  • Persistent in LevelDOWN; some removed, some added
  • Leaks!
  • Snappy 1.1.1
  • Some embarrassing bugs
  • Domains
  • Summary
  • A final note on Node 0.11.9

V8 Persistent references

The main story of this release are v8::Persistent references. For the uninitiated, V8 internally has two different ways to track "handles", which are references to JavaScript objects and values currently active in a running program. There are Local references and there are Persistent references. Local references are the most common, they are the references you get when you create an object or pass them around within a function and do the normal work that you do with an object. Persistent references are a special case that is all about Garbage Collection. An object that has at least one active Persistent reference to it is not a candidate for garbage collection. Persistent references must be explicitly destroyed before they release the object and make it available to the garbage collector.

Prior to V8 3.2x.xx (I don't know the exact version, does it matter? It roughly corresponds to Node v0.11.3.), these handles were both as easy as each other to create and interchange. You could swap one for the other whenever you needed. My guess is that the V8 team decided that this was a little too easy and that a major cause for memory leaks in C++ V8 code was the ease at which you could swap a Local for a Persistent and then forget to destroy the Persistent. So they tweaked the "ease" equation and it's become quite difficult.

Persistent and Local no longer share the same type hierarchy and the way you instantiate and assign a Persistent has become quite awkward. You now have to go through enough gymnastics to create a Persistent that it makes you ask the question: "Do I really need this to be a Persistent?" Which I guess is a good thing for memory leaks. NAN to the rescue though! We've somewhat papered over those difficulties with the capabilities introduced in NAN, it's still not as easy as it once was but it's not a total headache.

So, you understand v8::Persistent now? Great, so back to LevelDOWN.

Persistent in LevelDOWN; some removed, some added!

Some removed

Recently, Matteo noticed that when you're performing a Batch() operation in LevelDB, there is an explicit copy of the data that you're feeding in to that batch. When you construct a Batch operation in LevelDB you start off with a short string representing the batch and then build on that string as you build your batch with both Put() and Del() operations. You end up with a long string containing all of your write data; keys and values. Then when you call Write() on the Batch, that string gets fed directly into the main LevelDB store as a single write—which is where the atomicity of Batch comes from.

Both the chained-form and array-form batch() operations work this way internally in LevelDOWN.

However, with almost all operations in LevelDOWN, we perform the actual writes and reads against LevelDB in libuv worker threads. So we have to create the "descriptor" for work in the main V8 Node thread and then hand that off to libuv to perform the work in a separate thread. Once the work is completed we get the results back in the main V8 Node thread from where we can trigger a callback. This is where Persistent references come in.

Before we hand off the work to libuv, we need to make Persistent references to any V8 object that we want to survive across the asynchronous operation. Obviously the main candidate for this is callback functions. Consider this code:

db.get('foo', function (err, value) {
  console.log('foo = %s', value)
})

What we've actually done is create an anonymous closure for our callback. It has nothing referencing it, so as far as V8 is concerned it's a candidate for garbage collection once the current thread of execution is completed. In Node however, we're doing asynchronous work with it and need it to survive until we actually call it. This is where Persistent references come in. We receive the callback function as a Local in our C++ but then assign it to a Persistent so GC doesn't touch it. Once we're done our async work we can call the function and destroy the Persistent, effectively turning it back in to a Local and freeing it up for GC.

Without the Persistent then the behaviour is indeterminate. It depends on the version of V8, the GC settings, the workload currently in the program and the amount of time the async work takes to complete. If the GC is aggressive enough and has a chance to run before our async work is complete, the callback will disappear and we'll end up trying to call a function that no longer exists. This can obviously lead to runtime errors and will most likely crash our program.

In LevelDOWN, if you're passing in String objects for keys and values then to pull out the data and turn it in to a form that LevelDB can use we have to do an explicit copy. Once we've copied the data from the String then we don't need to care about the original object and GC can get its hands on it as soon as it wants. So we can leave String objects as Local references while we are building the descriptor for our async work.

Buffer objects are a different matter all together. Because we have access to the raw character array of a Buffer, we can feed that data straight in to LevelDB and this will save us one copy operation (which can be a significant performance boost if the data is significantly large or you're doing lots of operations—so prefer Buffers where convenient if you need higher perf). When building the descriptor for the async work, we are just passing a character array to the LevelDB data structures that we're setting up. Because the data is shared with the original Buffer we have to make sure that GC doesn't clean up that Buffer before we have a chance to use the data. So we make a Persistent reference for it which we clean up after the async work is complete. So you can do this without worrying about GC:

db.put(
    new Buffer('foo')
  , require('crypto').randomBytes(1024)
  , function (err) {
      console.log('foo is now some random data!')
    }
)

This has been the case in LevelDOWN for all operations since pretty much the beginning. But back to Matteo's observation. If LevelDB's data structures perform an explicit copy on the data we feed it then perhaps we don't need to keep the original data safe from GC? For a batch() call it turns out that we don't! When we're constructing the Batch descriptor, as we feed in data to it, both Put() and Del(), it's taking a copy of our data to create its internal representation. So even when we're using Buffer objects on the JavaScript side, we're done with them before the call down in to LevelDOWN is completed so there's no reason to save a Persistent reference! For other operations we're still doing some copying during the asynchronous cycle but the removal of the overhead of creating and deleting Persistent references for batch() calls is fantastic news for those doing bulk data loading (like Max Ogden's dat project which needs to bulk load a lot of data).

Some added

Another gem from Matteo was reports of crashes during certain batch() operations. Difficult to reproduce and only under very particular circumstances, it seems to be mostly reproducible by the kinds of workloads generated by LevelGraph. Thanks to some simple C++ debugging we traced it to a dropped reference, obviously by GC. The code in question boiled down to something like this:

function doStuff () {
  var batch = db.batch()
  batch.put('foo', 'bar')
  batch.write(function (err) {
    console.log('done', err)
  })
}

In this code, the batch object is actually a LevelDOWN Batch object created in C++-land. During the write() operation, which is asynchronous, we end up with no hard references to batch in our code because the JS thread has yieled and moved on and the batch is contained within the scope of the doStuff() function. Because most of the asynchronous operations we perform are relatively quick, this normally doesn't matter. But for writes to LevelDB, if you have enough data in your write and you have enough data already in your data store, you can trigger a compaction upstream which can delay the write which can give V8's GC time to clean up references that might be important and for which you have no Persistent handles.

In this case, we weren't actually creating internal Persistent references for some of our objects. Batch in this case but also Iterator. Normally this isn't a problem because to use these objects you generally keep references to them yourself in your own code.

We managed to debug Matteo's crash by adjusting his test code to look something like this and watching it succeed without a crash:

function doStuff () {
  var batch = db.batch()
  batch.put('foo', 'bar')
  batch.write(function (err) {
    console.log('done', err)
    batch.foo = 'bar'
  })
}

By reusing batch inside our callback function, we're creating some work that V8 can't optimise away and therefore has to assume isn't a noop. Because the batch variable is also now referenced by the callback function and we already have an internal Persistent for it, GC has to pass over batch until the Persistent is destroyed for the callback.

So the solution is simply to create a Persistent for the internal objects that need to survive across asynchronous operations and make no assumptions about how they'll be used in JavaScript-land. In our case we've gone for assigning a Persistent just prior to every asynchronous operation and destroying it after. The alternative would be to have a Persistent assigned upon the creation of objects we care about but sometimes we want GC to do its work:

function dontDoStuff () {
  var batch = db.batch()
  batch.put('foo', 'bar')
  // nothing else, wut?
}

I don't know why you would write that code but perhaps you have a use-case where you want the ability to start constructing a batch but then decide not to follow through with it. GC should be able to take care of your mess like it does with all of the other messes you create in your daily adventures with JavaScript.

So we are only assigning a Persistent when you do a write() with a chained-batch operation in LevelDOWN since it's the only asynchronous operation. So in dontDoStuff() GC will come along and rid us of batch, 'foo' and 'bar' when it has the next opportunity and our C++ code will have the appropriate destructors called that will clean up any other objects we have created along the way, like the internal LevelDB Batch with its copy of our data.

Leaks!

We've been having some trouble with leaks in LevelUP/LevelDOWN lately (LevelDOWN/#171, LevelGraph/#40). And it turns out that these leaks aren't related to Persistent references, which shouldn't be a surprise since it's so easy to leak with non-GC code, particularly if you spend most of your day programming in a language with GC.

With the help of Valgrind we tracked the leak down to the omission of a delete in the destructor of the asynchronous work descriptor for array-batch operations. The internal LevelDB representation of a Batch wasn't being cleaned up unless you were using the chained-form of LevelDOWN's batch(). This one has been dogging us for a few releases now and it's been a headache particularly for people doing bulk-loading of data so I hope we can finally put it behind us!

Snappy 1.1.1

Google released a new version of Snappy, version 1.1.1. I don't really understand how Google uses semver; we get very simple LevelDB releases with the minor version bumped and then we get versions of Snappy released with non-trivial changes with only the patch version bumped. I suspect that Google doesn't know how it uses semver either and there's no internal policy on it.

Anyway, Snappy 1.1.1 has some fixes, some minor speed and compression improvements but most importantly it breaks compilation on Windows. So we had to figure out how to fix that for this release. Ugh. I also took the opportunity to clean up some of the compilation options for Snappy and we may see some improvements in the way it works now... perhaps.

Some embarrassing bugs

Amine Mouafik is new to the LevelDOWN repository but has picked up some rather embarrassing bugs/omissions that are probably my fault. It's great to have more eyes on the C++ code, there's not enough JavaScript programmers with the confidence to dig in to messy C++-land.

Firstly, on our standard LevelDOWN releases, it turns out that we haven't actually been enabling the internal bloom filter. The bloom filter was introduced in LevelDB to speed up read operations to avoid having to scan through whole blocks to find the data a read is looking for. So that's now enabled for 0.10.

Then he discovered that we had been turning off compression by default! I believe this happened with the the switch to NAN. The signature for reading boolean options from V8 objects was changed from internal LD_BOOLEAN_OPTION_VALUE & LD_BOOLEAN_OPTION_VALUE_DEFTRUE macros for defaulting to true and false respectively when the options aren't supplied, to the NAN version which is a unified NanBooleanOptionValue which takes an optional defaultValue argument that can be used to make the default true. This happened at roughly Node version 0.11.4.

Well, this code:

bool compression =
    NanBooleanOptionValue(optionsObj, NanSymbol("compression"));

is now this:

bool compression =
    NanBooleanOptionValue(optionsObj, NanSymbol("compression"), true);

so if you don't supply a "compression" boolean option in your db setup operation then it'll now actually be turned on!

Domains

We've finally caught up with properly supporting Node's domains by switching all C++ callback calls from standard V8 callback->Call(...) to Node's own node::MakeCallback(callback, ...) which does the same thing but also does lots of additional things, including accounting for domains. This change was also included in NAN version 0.5.0.

Summary

Go and upgrade!

leveldown@0.10.0 is packaged with the new levelup@0.18.0 and level@0.18.0 which have their minor versions bumped purely for this LevelDOWN release.

Also released are the packages:

  • leveldown-hyper@0.10.0
  • leveldown-basho@0.10.0
  • rocksdb@0.10.0 (based on the same LevelDOWN code) (Linux only)
  • level-hyper@0.18.0 (levelup on leveldown-hyper)
  • level-basho@0.18.0 (levelup on leveldown-basho)
  • level-rocks@0.18.0 (levelup on rocksdb) (Linux only)

I'll write more about these packages in the future since they've gone largely under the radar for most people. If you're interested in catching up then please join ##leveldb on Freenode where there's a bunch of Node database people and also a few non-Node LevelDB people like Robert Escriva, author of HyperLevelDB and all-round LevelDB expert.

A final note on Node 0.11.9

There will be a LevelDOWN@0.10.1 very soon that will increment the NAN dependency to 0.6.0 when it's released. This new version of NAN will specifically deal with Node 0.11.9 compatibility where there are more breaking V8 changes that will cause compile errors for any addon not taking them in to account. So if you're living on the edge in Node then we should have a release soon enough for you!

All the levels!

When we completely separated LevelUP and LevelDOWN so that installing LevelUP didn't automatically get you LevelDOWN, we set up a new package called Level that has them both as a dependency so you just need to do var level = require('level') and everything is done for you.

But, we now have more than just the vanilla (Google) LevelDB in LevelDOWN. We also have a HyperLevelDB version and a Basho fork. These are maintained on branches in the LevelDOWN repo and are usually released now every time a new LevelDOWN is released. They are called leveldown-hyper and leveldown-basho in npm but you need to plug them in to LevelUP yourself to make them work. We also have Node LMDB that's LevelDOWN compatible and a few others.

So, as of today, we've released a new, small library called level-packager that does this bundling process so that you can feed it a LevelDOWN instance and it'll return a Level-type object that can be exported from a package like Level. This is meant to be used internally and it's now being used to support these new packages that are available in npm:

  • level-hyper bundles the HyperLevelDB version of LevelDOWN with LevelUP
  • level-basho bundles the Bash fork of LevelDB in LevelDOWN with LevelUP
  • level-lmdb bundles Node LMDB with LevelUP

The version numbers of these packages will track the version of LevelUP.

So you can now simply do:

var level = require('level-hyper')
var db = level('/path/to/db')
db.put('foo', 'woohoo!')

If you're already using Level then you can very easily switch it out with one of these alternatives to try them out.

Both HyperLevelDB and the Basho LevelDB fork are binary-compatible with Google's LevelDB, with one small caveat: with the latest release, LevelDB has switched to making .ldb files instead of .sst files inside a data store directory because of something about Windows backups (blah blah). Neither of the alternative forks know anything about these new files yet so you may run in to trouble if you have .ldb files in your store (although I'm pretty sure you can simply rename these to .sst and it'll be fine with any version).

Also, LMDB is completely different to LevelDB so you won't be able to open an existing data store. But you should be able to do something like this:

require('level')('/path/to/level.db').createReadStream()
  .pipe(require('level-lmdb')('/path/to/lmdb.db').createWriteStream())

Whoa...

A note about HyperLevelDB

Lastly, I'd like to encourage you to try the HyperLevelDB version if you are pushing hard on LevelDB's performance. The HyperDex fork is tuned for multi-threaded access for reads and writes and is therefore particularly suited to how we use it in Node. The Basho version doesn't show much performance difference mainly because they are optimising for Riak running 16 separate instances on the same server so multi-threaded access isn't as interesting for them. You should find significant performance gains if you're doing very heavy writes in particular with HyperLevelDB. Also, if you're interested in support for HyperLevelDB then pop in to ##leveldb on Freenode and bother rescrv (Robert Escriva), author of HyperLevelDB and our resident LevelDB expert.

It's also worth nothing that HyperDex are interested in offering commercial support for people using LevelDB, not just HyperLevelDB but also Google's LevelDB. This means that anyone using either of these packages in Node should be able to get solid support if they are doing any heavy work in a commercial environment and need the surety of experts behind them to help pick up the pieces. I imagine this would cover things like LevelDB corruption and any LevelDB bugs you may run in to (we're currently looking at a subtle batch-related LevelDB bug that's come along with the 1.14.0 release, they do exist!). Talk to Robert if you want more information about commercial support.

Primitives for JS Databases (an LXJS adventure)

I gave a talk yesterday at LXJS yesterday in the "Infrastructure.js" block and tried to talk about JavaScript Database Primitives; i.e. the basic building blocks we have landed on for building more complex database solutions in JavaScript.

The talk certainly wasn't as good or clear as I wanted it to be, it worked much better in my head! A huge venue with over 300 talented JavaScripters, an absolutely massive screen, bright lights and loud amplification got the better of me and I wasn't able to pull the material together how I wanted to. The introvert within me is telling me to become a recluse for a little while just to recover! My hope is that at least one or two people are inspired to give database hacking a go because it's really not that difficult once you get your head around the primitives.

Edit: I wasn't trying to elicit sympathy here, I genuinely think that I wasn't clear on what I was trying to communicate. It went so well in my head, as it usually does, but I fell far short of what I wanted to express. I'll attempt to rectify some of that with a writeup (see next para).

Thankfully though, a portion of the material will be able to serve as the basis for the, long overdue, third part in my three part DailyJS series on LevelDB & Node.

In summary, inspired by LevelDB, we've ended up with a core set of primitives in LevelUP that can be used to build feature-rich and advanced database functionality. Atomic batch and ReadStream are the two non-trivial primitives, open, close, get, put, del are all pretty easy to understand as primitives, although del is perhaps redundant but we're opting for explicitness.

My slides are online but hopefully I'll be able to get my DailyJS article sorted out soon and I'll be able to explain what I was trying to get at.

ReadStream as a primitive query mechanism is not too hard to understand once you get your head around key sorting and the implications for key structure. Batch is a little more subtle and relates to consistency and our ability to augment basic operations to create more complex functionality while keeping the data store in a consistent state.

I additionally raised "Buckets", or "Namespaces" as a primitive concept and discussed how sublevel has effectively become the standard for turning a one-dimensional data store into a multi-dimensional store able to encapsulate contain sophisticated functionality behind what is essentially just a key/value store.

Thanks to the LXJS team

It would be neglectful of me to not say how absolutely grateful I am to the LXJS team for putting so much effort into taking care of speakers; fantastic job.

LXJS is an amazing event, put on by a dedicated and very talented team of people committed to the JavaScript community and the JavaScript community in Portugal in particular. This conference sets a very high bar for community-driven conferences with the way it has managed to get so many locals (and internationals!) involved in running an event in their own time.

David Dias, Ana Hevesi, Pedro Teixeira, Luís Reis, Nuno Job, Tiago Rodrigues, Leo Xavier, Alexander Kustov, André Rodrigues and Bruno Coelho have managed to put on an amazing event and are some of the nicest and talented people I've met. Thank you to you all and everyone else who put on LXJS 2013, your hard work is appreciated and should be an inspiration to everyone involved in our local JavaScript communities, running events or considering running events like this.

Should I use a single LevelDB or many to hold my data?

This is a long overdue post, so long in fact that I can't remember who I promised to do this for! Regardless, I keep on having discussions around this topic so I thought it worthwhile putting down some notes on what I believe to be the factors you should consider when making this decision.

What's the question?

It goes like this: You have an application that uses LevelDB, in particular I'm talking about Node.js applications here but the same would apply if you're using LevelUP in the browser and also most of the other back-ends for LevelUP. And you invariably end up with different kinds of data, sometimes the kinds of data you're storing is so different that it feels strange putting them into the same storage blob. Often though, you just have sets of not-very-related data that you need to store and you end up having to make a decision: do I put everything into a single LevelDB store or do I put things into their own, separate, LevelDB store?

This stuff doesn't belong together!

Coming from an relational database background, it took me a little while to displace the concept of discrete tables with the notion of namespacing within the same store. I can understand the temptation to want to keep things separate, not wanting to end up with a huge blob of data that just shouldn't be together. But this isn't the relational database world and you need to move on!

We have a set of LevelUP addons, such as sublevel, that exist mainly to provide you with the comfort of being able to separate your data by whatever criteria makes sense. bytewise is another tool that can serve a similar purpose and some people even use sublevel and bytewise together to achieve more complex organisation.

We have the tools at our disposal in Node.js to turn a one-dimensional storage array into a very complex, multidimensional storage system where unrelated, and semi-related data can coexist. So, if the only reason you want to store things in separate stores is because it just feels right to do so, you should probably be looking at what's making you think that way. You may need to update your assumptions.

Technical considerations

That aside, there are some technical considerations for making this decision:

Size and performance

To be clear, LevelDB is fast and it can also store lots of data, it'll handle Gigabytes of data without too much sweat. However, there are some performance concerns when you start getting in to the Gigabyte range, mainly when you're trying to push data in at a high rate. Most use-cases don't do this so be honest about your performance needs. For most people LevelDB is simply fast.

However, if you do have a high-throughput scenario involving a large amount of data that you need to store then you may want to consider having a separate store to deal with the large data and another one to deal with the rest of your data so the performance isn't impacted across the board.

But again, be honest about what your workload is, you're probably not pushing Voxer amounts of data so don't prematurely optimise around the workload you'd like to think you have or are going to have one day in the distant future.

Cache

Caching is transparent by default with LevelDB so it's easy to forget about it when making these kinds of decisions but it's actually quite important for this particular question.

By default, you have an 8M LRU cache with LevelDB and all reads use that cache, for look-ups and also for updating with newly read values. So, you can have a lot of cache-thrash unless you're reading the same values again and again.

But, there is a fillCache (boolean) option for read operations (both get() and createReadStream() and its variations). So you can set this to false where you know you won't be needing fast access to those entries again and you don't want to push out other entries from the LRU.

So caching strategies can be separate for different types of data and are not a strong reason to keep things in a separate data store.

I always recommend that you should tinker with the cacheSize option when you're using LevelDB, it can be as large as you want to fit in the available memory of your machine. As a rule of thumb, somewhere between 2/3 and 3/4 of the available memory should be a maximum if you can afford it.

Consider though what happens if you are using separate LevelDB stores, you now have to deal with juggling cacheSize between the stores. Often, you're probably going to be best served by having a single, large cache that can operate across all your data types and let the normal behaviour of your application determine what gets cached with occasional reliance on 'fillCache': false to fine-tune.

Consistency

As I discussed in my LXJS talk, the atomic batch is an important primitive for building solid database functionality with inherent consistency. When you're using sublevel, even though you have what operate like separate LevelUP instances for each sublevel, you still get to perform atomic batch operations between sublevels. Consider indexing where you may have a primary sublevel for the entries you're writing and a secondary sublevel for the indexing data used to reference the primary data for lookups. If you're running these as separate stores then you lose the benefits of the atomic batch, you just can't perform multiple operations with guaranteed consistency.

Try and keep the atomic batch in mind when building your application, instead of accepting the possibility of inconsistent state, use the batch to keep consistency.

Back-end flexibility

OK, this one is a bit left-field, but remember that LevelUP is back-end-agnostic. It's inspired by LevelDB but it doesn't have to be Google's LevelDB that's storing data for you. It could be Basho's fork or HyperLevelDB. It could even be LMDB or something a little crazy like MemDOWN or mysqlDOWN!

If you're at all concerned about performance, and most people claim to be even though they're not building performance-critical applications, then you should be benchmarking your particular workload against your storage system. Each of the back-ends for LevelUP have different performance characteristics and different trade-offs that you need to understand and test against your needs. You may find that one back-end works for one kind of data in your application and another back-end works for another.

Summary

The TL;DR is: in most cases, a single LevelDB store is generally preferable unless you have a real reason for having separate ones.

Have I missed any considerations that you've come across when making this choice? Let me know in the comments.