r.va.gg

The perils of private politics in open source

2018-09-05T00:00:00.000Z

The Node community and its leadership is evolving

Node Summit this year was an interesting, and encouraging experience. The stage was full of fresh faces. Fresh faces who weren't there just because they were fresh faces—the project and its surrounding community has genuinely refreshed. A couple of moments in the "hallway track" were instructive as we saw some historical big-names of Node, like Mikeal Rogers, go unrecognised by the busy crowds of active Node users. Node has not only grown up, but it's moved forward as any successful project should, and it has a new faces with fresh passion.

Node.js is a complex open source project to manage. There's a huge amount of activity surrounding it and it's become a critical piece of internet infrastructure, so it needs to be stable and trustworthy. Unfortunately this means that it needs corporate structures around it as it interacts with the world outside. One of the roles of the Node.js Foundation is to serve this purpose for the project.

The inevitability of politics

Politics is an unfortunate fact for open source projects of notable size. Over time, the cruft of corporate and personal politics builds up and creates messy baggage. Like an iceberg, the visible portion of politics represents only a small amount of what's built up over time and what's currently brewing. I had the displeasure of seeing behind the curtain of Node politics as I became more heavily involved, 6 or so years ago. It's not exactly a pretty sight, but I'm certain it's not unique to Node. For the most part, we have a tacit agreement to keep a lot of the messy stuff private. I suppose the theory here is that there's no reason to taint the rosy view of users and contributors who will never be directly impacted by it, and will never even have reason to see it. Speaking about a lot of these things would simply fall into the gossip category, in fact. So we set it aside and move on. But it never really goes away, the pile of cruft just gets bigger.

On many occasions I have watched as idealistic new contributors, technical and non-technical, are raised up to positions where they become exposed to the private politics. It's disheartening to stand by (or worse: be involved in the process), as someone with a rosy view of the Node community is lead, nose first, into its smelly armpits. Of course the armpits are only a small part of the body, so having a rosy view isn't illegitimate if you don't intend to get all up into the smelly bits! But leadership requires a good amount of exposure to the smelly parts of Node.

Many people who step into the elite realms of the Technical Steering Committee (TSC) and Community Committee eventually discover that there's a lot behind the curtains. I use the term "elite" sarcastically here, because that's not what these bodies are intended to be. But the fact of having a curtain creates an unfortunate tendency toward a separateness.

I'm confident that most, if not all, of the people currently on the TSC and the Community Committee, are in open source because they're attracted to open community. That's what's so great about Node and other thriving open source ecosystems. Many of us are in awkward positions where we have a foot in the corporate world and a foot in open source, so we have to learn to operate in different modes. We end up making choices about how we conduct ourselves regarding the community vs corporate and it not easy to stay conscious of the different requirements of our various roles.

There's a danger in being drawn up into these kinds of prominent positions in a complex open source project. We get exposed to the armpit, whether we like it or not. Making matters more complex, we are backed by, and therefore have to interact with, a corporation: the Node.js Foundation. The Foundation is run by a board of highly experienced corporate-political operators (and I mean no disrespect by this—it takes a significant amount of skill to navigate to the kinds of positions in large companies that make you an obvious choice to sit on such boards). Furthermore, the Node.js Foundation is essentially a shell that lives inside the Linux Foundation, a very rich source of corporate politics of the open source variety.

So we are faced with competing pressures: our personal passions for open communities, and the complexities of private corporate-style politics that don't fit very well with "openness" and "transparency". I've felt this since the Node.js Foundation begun—I sat on the board for two terms at the beginning, participating in those politics. But I've also been one of the loudest voices for transparency and "open governance"—a model that I championed prior to it being adopted by io.js, followed by Node.js under the Foundation. But to be honest, my record is mixed, just like everyone else who has had to navigate similar positions. None of us leave unscathed.

A TSC representative to the board usually wants to be able to report important items to the TSC where they are allowed, and also use the TSC as a sounding-board as they participate in the corporate decision-making process. The dual pressures of "this is sensitive and private" vs "we should be open and transparent" are nasty. Corporate structures and the kinds of politics they engender are not inherently bad, they are arguably necessary in the world in which we exist. In professionalised open source, we are trying to squish together openness and the closed nature of the corporate world which creates a lot of tension and conflicting incentives. It's not hard to understand why many open source projects actively avoid this kind of "professionalisation".

Serving as counter balance to the corporate

Here's the critical part that's so easy to lose sight of: those on the open source side should be the advocates for openness. That's a large reason we get to be at the big table and it's on us to keep the pressure on, to ensure that a tension continues to exist. Those on the other side are advocates (in some cases legally so) of the corporate approach. The lure of private politics is so strong that it will always have momentum on its side, and it takes very conscious effort to push back against it. From discussions during the formation of the Foundation, I know that there are many on the corporate side that expect open source folks to provide this kind of pressure, that's part of the system's design. Perhaps we should even insert our responsibility as advocates for openness and transparency as an explicit feature of project governance.

The temptation to "keep it private", "take it offline", or "get it right before going public", is natural, because it honestly makes many things easier in the short-term, i.e., it's the path of expedience. Our heated sociopolitical environment today makes this worse; with the potential for drama and mob behaviour leading to an understandable risk aversion.

And so we have private mailing lists, private sections of "public" meetings, one-on-one strategy discussions, huddles in the corners of conferences where we occasionally meet in person. We have to keep the wheels turning and it's easier to just push things through privately than have to deal with the friction of public discussion, feedback, criticism, drama.

But that's neglecting what our communities charge us with! Particularly for Node.js, where we have a governance model that makes it clear that the TSC doesn't own or run the project. The TSC is intended to simply be a decision-making fallback.

The project is intended to be owned and run by its contributors. The TSC should be the facilitators of that, and reluctantly involve itself collectively where individual contributors can't find a way forward. The Community Committee is intended take an outward facing role, making for a different kind of challenging bargain when they get sucked in. Today we have various stake-holders asking for "the TSC's opinion", or "the Community Committee's opinion" on matters. I even tried to get that when I was a board member, attempting to "represent the TSC," which I believed was my role at the time. But there really should be no such thing as the TSC or Community Committee's "opinion", it's frankly absurd when you consider that these bodies are supposed to be made up of a broad diversity of viewpoints.

The more we accept private decision-making and politicking, the more we undermine community-focused governance.

A busy time for private politicking

It could be an artefact of my perspective, but it seems that we have a nexus of some very heavy private political discussions happening in Node.js-land at the moment. My fear is that it's the sign of a trend, and I hope this post can help serve as a corrective.

Some highlights that you probably won't see the context of on GitHub include:

Discussions about the Node.js Foundation's annual conference, Node.js Interactive, which was renamed this year to JS Interactive, and then renamed again recently to Node+JS Interactive. The Foundation's executive decided to involve the TSC and Community Committee in that last decision and it was resolved entirely in private (as far as I'm aware) and then announced to the world. I personally thought the switch to "JS Interactive" was a mistake. I also thought that changing the name again was a mistake. To be honest (and in hindsight), however, I'd rather the executive didn't even draw the TSC and Community Committee in, as collectives, to these private discussions, because we've now become complicit in the private decision making process. It's really not a good look for either committee to be involved in surprise major announcements—that stands in stark contrast to our open decision making processes. Seek individual feedback, sure, but this also goes back to my point about the problems with seeking collective opinions.
Discussions about the efficacy of the Foundation as an executive body, particularly as it focuses on filling an empty Executive Director (i.e. CEO) chair. I'll admit guilt to fuelling a lot of this discussion myself, I've been a strong critic of the Foundation in recent times. However, those of us with something to say either need to be bold enough to be public with critique, take it directly to decision-makers involved, or butt-out entirely. As an advocate of openness and transparency, I'd suggest that public discussion on these matters would be fruitful because so many people and organisations are impacted by them.
Discussions about very major restructuring of the Node.js Foundation itself. Having implications that would call for large changes to the by-laws, including changing the very purpose of the Foundation. I don't want to be the one to speak publicly about this, but I would like to see those who are driving this discussion be able to make their case in public sooner rather than later. This will again lead to surprise major announcements that the TSC and Community Committee will again be complicit in. The board should either make such changes and own the responsibility for it, or should set up an open process for feedback and discussion. The TSC and Community Committee should be rejecting the requests made of them to be the source of feedback and discussion prior to major changes. These bodies can facilitate broader discussions, but they are not the source of definitive opinion or truth for the Node.js community.
Discussions about major changes to Node.js project governance, instigated by parties external to the TSC and Community Committee, entirely in private and with significant political and ideological pressure. Large discussion threads and entire meetings have been devoted to these matters already, without one hint to the outside world that a flurry of pull requests may soon appear with little context. Given some of the ideological content there is potential for more drama so I have a lot of sympathy for people for wanting to take the easy path with this. I'm close enough to the centre of these matters that I will likely write publicly about them soon. I have very strong opinions about what's good for open governance, and maintaining a diversity of opinion and viewpoints on the critical bodies surrounding Node.js. My primary objection to the process conducted so far is that any discussions about change in open source governance by governing bodies must be conducted in the open. And that any private collective change-planning by these bodies undermines the community-focused open governance model that Node.js has adopted.

Finding a balancing point

I'd like to not give private politics any more legitimacy in open source. Leave it for the corporate realm. I'd like for the TSC and Community Committee to adopt an explicit position of being against closed-door discussions wherever possible and being advocates for openness and transparency. That will cause personal conflicts, it will mean difficulty for the Node.js Foundation board and the executive when they want to engage in "sensitive" topics. But the definition of "sensitive" on our side should be different. What's more, we are not there to solve their problems, it's precisely the opposite!

Here's my initial suggestions for guidelines:

If you want to call something "sensitive" then it better involve something personal about an individual who could be hurt by public discussion of the matter.
If you have a proposal for changing anything and you're not prepared to receive public feedback and critique, then it's not worth us discussing as groups.
The TSC and Community Committees should refuse, as much as possible, to be involved as collectives in decision-making processes that must be private for corporate governance or legal reasons. That's not their role and it's unfair to force them into an awkward corner.

The wriggle-room between treating the TSC and Community Committee as "groups" vs pulling individuals from those groups into private politics as a proxy is a tricky one that every individual is going to have to negotiate. I would hope that having these groups explicitly state their preference for openness, while highlighting the risks would go a long way to creating the healthy tension that we need with our corporate partners.

I hope it’s obvious, but this is all my personal opinion and does not necessarily represent that of my employer nor any bodies surrounding Node.js that I’m involved in.

Node.js and the "HashWick" vulnerability

2018-08-30T00:00:00.000Z

The following post was originally published on the NodeSource Blog. This text is copyright NodeSource and is reproduced with permission.

Yesterday, veteran Node.js core contributor and former Node.js TSC member Fedor Indutny published an article on his personal blog detailing a newly-discovered vulnerability in V8. Named HashWick, this vulnerability will need to be addressed by Node.js, but as yet has not been patched.

This article will cover the details surrounding the disclosure yesterday, and explain some of the technical background. As a patch for Node.js is not yet available, I will also present some mitigation options for users and discuss how this vulnerability is likely to be addressed by Node.js.

Responsible disclosure

Fedor originally reported this vulnerability to V8 and the Node.js security team in May. Unfortunately, the underlying issues are complex, and Node's use of older V8 engines complicates the process of finding and applying a suitable fix. The Node.js TSC delegated responsibility to the V8 team to come up with a solution.

After reporting the vulnerability, Fedor followed a standard practice of holding off public disclosure for 90 days, and although a fix has yet to land in Node, he published high-level details of his findings.

It is worth pointing out that Fedor’s disclosure does not contain code or specific details on how to exploit this vulnerability; moreover, to exploit HashWick a malicious party would need to tackle some fairly difficult timing analysis. However, knowledge that such a vulnerability exists, and can potentially be executed on a standard PC, is likely to spur some to reverse engineer the details for themselves.

These circumstances leave us all in an awkward situation while we wait for a fix, but I expect this disclosure to result in security releases in Node.js in the coming weeks.

Vulnerability details

There are three important concepts involved in this vulnerability:

Hash functions and hash tables
Hash flooding attacks
Timing analysis

Hash functions

Hash functions are a fundamental concept in computer science. They are typically associated with cryptography, but are widely used for non-cryptographic needs. A hash function is simply any function that takes input data of some type and is able to repeatedly return output of a predictable size and range of values. An ideal hash function is one that exhibits apparent randomness and whose results spread evenly across the output range, regardless of input values.

To understand the utility of such functions, consider a "sharded" database system, divided into multiple storage backends. To route data storage and retrieval, you need a routing mechanism that knows which backend that data belongs in. Given a key, how should the routing mechanism determine where to put new data, and then where to get stored data when requested? A random routing mechanism isn't helpful here, unless you also want to store metadata telling you which random backend a particular key's value was placed in.

This is where hash functions come in handy. A hash function would allow you to take any given key and return a “backend identifier” value, directing the routing mechanism to assign data to a particular backend. Despite apparent randomness, a good hash function can thus distribute keys across all of your backends fairly evenly.

This concept also operates at the most basic levels of our programming languages and their runtimes. Most languages have hash tables of some kind; data structures that can store values with arbitrary keys. In JavaScript, almost any object can become a hash table because you can add string properties, and store whatever values you like. This is because Object is a form of hash table, and almost everything is related to Object in some way. const foo = { hash: 'table' } stores the value 'table' at key 'hash'. Even an Array can take the form of a hash table. Arrays in JavaScript are not limited to integer keys, and they can be as sparse as you like: const a = [ 1, 2, 3 ]; a[1000] = 4; a['hash'] = 'table';. The underlying storage of these hash tables in JavaScript needs to be practical and efficient.

If a JavaScript object is backed by a memory location of a fixed size, the runtime needs to know where in that space a particular key's value should be located. This is where hash functions come in. An operation such as a['hash'] involves taking the string 'hash', running it through a hash function, and determining exactly where in the object's memory storage the value belongs. But here's the catch: since we are typically dealing with small memory spaces (a new Array in V8 starts off with space for only 4 values by default), a hash function is likely to produce "collisions", where the output for 'hash' may collide with the same location as 'foo'. So the runtime has to take this into account. V8 deals with collision problems by simply incrementing the storage location by one until an empty space can be found. So if the storage location for 'hash' is already occupied by the value of 'foo', V8 will move across one space, and store it there if that space is empty. If a new value has a collision with either of these spaces, then the incrementing continues until an empty space is found. This process of incrementing can become costly, adding time to data storage operations, which is why hash functions are so important: a good hash function will exhibit maximum randomness.

Hash flooding attacks

Hash flooding attacks take advantage of predictability, or poor randomness, in hash functions to overwhelm a target and force it to work hard to store or look up values. These attacks essentially bypass the utility of a hash function by forcing excessive work to find storage locations.

In our sharded data store example above, a hash flood attack may involve an attacker knowing exactly how keys are resolved to storage locations. By forcing the storage or look-up of values in a single backend, an attacker may be able to overwhelm the entire storage system by placing excessive load on that backend, thereby bypassing any load-sharing advantage that a bucketing system normally provides.

In Node.js, if an attacker knows exactly how keys are converted to storage locations, they may be able to send a server many object property keys that resolve to the same location, potentially causing an increasing amount of work as V8 performs its check-and-increment operations finding places to store the values. Feed enough of this colliding data to a server and it'll end up spending most of its time simply trying to figure out how to store and address it. This could be as simple as feeding a JSON string to a server that is known to parse input JSON. If that JSON contains an object with many keys that all collide, the object construction process will be very expensive. This is the essence of a denial-of-service (DoS) attack: force the server to do an excessive amount of work, preventing it from being able to perform its normal functions.

Hash flooding is a well known attack type, and standard mitigation involves very good hash functions, combined with additional randomness: keyed hash functions. A keyed hash function, is a hash function that is seeded with a random key. That same seed is provided with every hash operation, so that together, the seed and an input value yield the same output value. Change the seed, and the output value is entirely different. In this way, it is not good enough to simply know the particular hash function being used, you also need to know the random seed the system is using.

V8 uses a keyed hash function for its object property storage operations (and other operations that require hash functions). It generates a random key at start-up and keeps on using that key for the duration of the application's lifetime. To execute a hash flood type attack against V8, you need to know the random seed it's using internally. This is precisely what Fedor has figured out how to do—determine the hash seed used by an instance of V8 by inspecting it from the outside. Once you have the seed, you can perform a hash flood attack and render a Node.js server unresponsive, or even crash it entirely.

Timing attacks

We covered timing attacks in some detail in our deep dive of the August 2018 Node.js security releases. A timing attack is a method of determining sensitive data or program execution steps, by analyzing the time it takes for operations to be performed. This can be done at a very low level, such as most of the recent high-profile vulnerabilities reported against CPUs that rely on memory look-up timing and the timing of other CPU operations.

At the application level, a timing attack could simply analyze the amount of time it takes to compare strings and make strong guesses about what's being compared. In a sensitive operation such as if (inputValue == 'secretPassword') ..., an attacker may feed many string variations and analyze the timing. The time it takes to process a inputValues of 'a', 'b' ... 's' may give enough information to assume the first character of the secret. Since timing differences are so tiny, it may take many passes and an average of results to be able to make strong enough inference. Timing attacks often involve a lot of testing and a timing attack against a remote server will usually involve sending a lot of data.

Fedor's attack against V8 involves using timing differences to work out the hash seed in use. He claims that by sending approximately 2G of data to a Node.js server, he can collect enough information to reverse engineer the seed value. Thanks to quirks in JavaScript and in the way V8 handles object construction, an external attacker can force many increment-and-store operations. By collecting enough timing data on these operations, combined with knowledge of the hash algorithm being used (which is no secret), a sophisticated analysis can unearth the seed value. Once you have the seed, a hash flood attack is fairly straightforward.

Mitigation

There are a number of ways a Node.js developer can foil this type of attack without V8 being patched, or at least make it more difficult. These also represent good practice in application architecture so they are worth implementing regardless of the impact of this specific vulnerability.

The front-line for mitigating against timing attacks for publicly accessible network services is rate limiting. Note that Fedor needs to send 2G of data to determine the hash seed. A server that implements basic rate limiting for clients is likely to make it more difficult or impractical to execute such an attack. Unfortunately, such rate limiting needs to be applied before too much internal V8 processing is allowed to happen. A JSON.parse() on an input string before telling the client that they have exceeded the maximum requests for their IP address won't help mitigate. Additionally, rate limiting may not mitigate against distributed timing attacks, although these are much more difficult to execute due to the variability in network conditions across multiple clients, leading to very fuzzy timing data.

Other types of input limiting will also be useful. If your service blindly applies a JSON.parse(), or other operation, to any length of input, it will be much easier for an attacker to unearth important timing information. Ensure that you have basic input limit checks in place and your network services don't blindly process whatever they are provided.

Standard load balancing approaches make such attacks more difficult too. If a client cannot control which Node.js instance it is talking to for any given connection, it will be much more difficult to perform a useful timing analysis of the type Fedor has outlined. Likewise, if a client has no way to determine which unique instance it has been talking to (such as a cookie that identifies the server instance), such an attack may be impossible given a large enough cluster.

The future for V8

As Fedor outlined in his post, the best mitigation comes from V8 fixing its weak hash function. The two suggestions he has are:

Increase the hash seed size from 32 bits to 64 bits
Replace the hash function with something that exhibits better randomness

The key size suggestion simply increases the complexity and cost of an attack, but doesn't make it go away. Any sufficiently motivated attacker with enough resources may be able to perform the same attack, just on a different scale. Instead of 2G of data, a lot more may need to be sent and this may be impossible in many cases.

A change of hash function would follow a practice adopted by many runtimes and platforms that require hash functions but need to protect against hash flood attacks. SipHash was developed specifically for this use and has been slowly adopted as a standard since its introduction 6 years ago. Perl, Python, Rust and Haskell all use SipHash in some form for their hash table data structures.

SipHash has properties similar to constant-time operations used to mitigate against other forms of timing attacks. By analyzing the timing of the hash function, you cannot (as far as we know) make inference about the seed being used. SipHash is also fast in comparison to many other common and secure keyed hash functions, although it may not be faster than the more naive operation V8 is currently using. Ultimately, it’s up to the V8 authors to come up with an appropriate solution that takes into account the requirement for security and the importance of speed.

Background Briefing: August Node.js Security Releases

2018-08-29T00:00:00.000Z

The following post was originally published on the NodeSource Blog as Node.js Security Release Summary - August 2018. This text is copyright NodeSource and is reproduced with permission. This is a deep-dive into the security vulnerabilities described in my brief summary on the Node.js blog as August 2018 Security Releases.

This month's Node.js security releases are primarily focused on upgrades to the OpenSSL library. There are also two minor Node.js security-related flaws in Node.js' Buffer object. All of the flaws addressed in the OpenSSL upgrade and the fixes to Buffer can be classified as either "low" or "very low" in severity. However, this assessment is generic and may not be appropriate to your own Node.js application. It is important to understand the basics of the flaws being addressed and make your own impact assessment. Most users will not be impacted at all by the vulnerabilities being patched but specific use-cases may cause a high severity impact. You may also be exposed via packages you are using via npm, so upgrading as soon as practical is always recommended.

Node.js switched to the new 1.1.0 release line of OpenSSL for version 10 earlier this year. Before Node.js 10 becomes LTS in October, we expect to further upgrade it to OpenSSL 1.1.1 which will add TLS 1.3 support. Node.js' current LTS lines, 8 ("Carbon") and 6 ("Boron") will continue to use OpenSSL 1.0.2.

In the meantime, OpenSSL continues to support their 1.1.0 and 1.0.2 release lines with a regular stream of security fixes and improvements and Node.js has adopted a practice of shipping new releases with these changes included shortly after their release upstream. Where there are non-trivial "security" fixes, Node.js will generally ship LTS releases with only those security fixes so users have the ability to drop in low-risk upgrades to their deployments. This is the case for this month's releases.

The August OpenSSL releases of versions 1.1.0i and 1.0.2p are technically labelled "bug-fix" releases by the OpenSSL team but they do include security fixes! The reason this isn't classified as a security release is that those security fixes have already been disclosed and the code is available on GitHub. They are low severity, and one of the three security items included doesn't even have a CVE number assigned to it. However, this doesn't mean they should be ignored. You should be aware of the risks and possible attack vectors before making decisions about rolling out upgrades.

OpenSSL: Client DoS due to large DH parameter (CVE-2018-0732)

All actively supported release lines of Node.js are impacted by this flaw. Patches are included in both OpenSSL 1.1.0i (Node.js 10) and 1.0.2p (Node.js 6 LTS "Boron" and Node.js 8 LTS "Carbon").

This fixes a potential denial of service (DoS) attack against client connections by a malicious server. During a TLS communication handshake, where both client and server agree to use a cipher-suite using DH or DHE (Diffie–Hellman, in both ephemeral and non-ephemeral modes), a malicious server can send a very large prime value to the client. Because this has been unbounded in OpenSSL, the client can be forced to spend an unreasonably long period of time to generate a key, potentially causing a denial of service.

We would expect to see a higher severity for this bug if it were reversed and a client could impose this tax on servers. But in practice, there are more limited scenarios where a denial of service is practical against client connections.

The fix for this bug in OpenSSL limits the number of bits in the prime modulus to 10,000 bits. Numbers in excess will simply fail the DH handshake and a standard SSL error will be emitted.

Scenarios where Node.js users may need to be concerned about this flaw include those where your application is making client TLS connections to untrusted servers, where significant CPU costs in attempting to establish that connection is likely to cause cascading impact in your application. A TLS connection could be for HTTPS, encrypted HTTP/2 or a plain TLS socket. An "untrusted server" is one outside of your control and not in the control of trustworthy third-parties. An application would likely need to be forced to make a large number of these high-cost connections for an impact to be felt, but you should assess your architecture to determine if such an impact is likely, or even possible.

OpenSSL: Cache timing vulnerability in RSA key generation (CVE-2018-0737)

Node.js is not impacted by this vulnerability as it doesn't expose or use RSA key generation functionality in OpenSSL. However, it is worth understanding some of the background of this vulnerability as we are seeing an increasing number of software and hardware flaws relating to potential timing attacks. Programming defensively so as to not expose the timing of critical operations in your application is just as important as sanitizing user input while constructing SQL queries. Unfortunately, timing attacks are not as easy to understand, or as obvious, so tend to be overlooked.

Side-channel attacks are far from new, but there is more interest in this area of security, and researchers have been focusing more attention on novel ways to extract hidden information. Spectre and Meltdown are the two recent high-profile examples that target CPU design flaws. CVE-2018-0737 is another example, and itself uses hardware-level design flaws. A paper by Alejandro Cabrera Aldaya, Cesar Pereida García, Luis Manuel Alvarez Tapia and Billy Bob Brumley from Universidad Tecnológica de la Habana (CUJAE), Cuba, and Tampere University of Technology, Finland outlines a cache-timing attack on RSA key generation, the basis of this OpenSSL flaw.

The CVE-2018-0737 flaw relies on a "Flush+Reload attack" which targets the last-level of cache on the system (L3, or level-3 cache on many modern processors). This type of attack exploits the way that Intel x86 architectures structure their cache and share it between processors and processes for efficiency. By setting up a local process that shares an area of cache memory with another process you wish to attack, you can make high-confidence inferences about the code being executed in that process. The attack is called "Flush+Reload" because the process executing the attack, called the "spy", causes a flush on the area of cache containing a piece of critical code, then waits a small amount of time and reloads that code in the cache. By measuring the amount of time the reload takes, the spy can infer whether the process under attack loaded, and therefore executed, the code in question or not. This attack looks at code being executed, not data, but in many cryptographic calculations, the sequence of operations can tell you all you need to know about what data is being generated or operated on. These attacks have been successfully demonstrated against different implementations of RSA, ECDSA and even AES. The attack has been shown to work across virtual machines in shared environments under certain circumstances. One researcher even demonstrated the ability to detect the sequence of operations executed by a user of vi on a shared machine.

An important take-away about cache-timing attacks is that they require local access to the system under attack. They are an attack type that probes the physical hardware in some way to gather information. Public clouds are usually not vulnerable because of the way cache is configured and partitioned, but we shouldn't assume we won't see new novel timing attacks that impact public clouds in the future. Of course browsers blur the definition of "local code execution", so we shouldn't be complacent! CVE-2018-0737 is marked as "Low" severity by the OpenSSL team because of the requirement for local access, the difficulty in mounting a successful attack and the rare circumstances in which an attack is feasible.

The best protection against Flush+Reload and many other classes of timing attacks is to use constant-time operations for cryptographic primitives and operations that expose potentially sensitive information. If an operation follows a stable code path and takes a constant amount of time regardless of input or output then it can be hard, or impossible to make external inference about what is going on. An operation as simple as if (userInput === 'supersecretkey') { ... } can be vulnerable to a timing attack if an attacker has the ability to execute this code path enough times. In 2014, as the PHP community debated switching certain operations to constant-time variants, Anthony Ferrara wrote a great piece about timing attacks and the types of mitigations available. Even though it addresses PHP specifically, the same concepts are universal.

The fix that OpenSSL applied for CVE-2018-0737 was a straight-forward switch to constant-time operations for the code in question. For RSA, this has the effect of masking the operations being performed from side-channel inspection, such as the use of cache.

Be aware that Node.js has a crypto.timingSafeEqual() operation that can be used whenever performing sensitive comparisons. Using this function, our vulnerable operation becomes if (crypto.timingSafeEqual(Buffer.fromString(userInput), Buffer.fromString('supersecretkey')) { ... } and we stop exposing timing information to potential attackers.

OpenSSL: ECDSA key extraction local side-channel

All actively supported release lines of Node.js are impacted by this flaw. Patches are included in both OpenSSL 1.1.0i (Node.js 10) and 1.0.2p (Node.js 6 LTS "Boron" and Node.js 8 LTS "Carbon").

This flaw does not have a CVE due to OpenSSL policy to not assign itself CVEs for local-only vulnerabilities that are more academic than practical. This vulnerability was discovered by Keegan Ryan at NCC Group and impacts many cryptographic libraries including LibreSSL, BoringSSL, NSS, WolfCrypt, Botan, libgcrypt, MatrixSSL, and of course OpenSSL. A CVE was assigned for this issue specifically for libgcrypt, CVE-2018-0495.

This flaw is very similar to the above RSA key generation cache-timing flaw in that it also uses cache-timing and an attacker must be able to execute code on the local machine being attacked. It also uses a Flush+Reload to infer the operations being performed but this time it examines Digital Signature Algorithm (DSA) the Elliptic Curve Digital Signature Algorithm (ECDSA), but a little more information is required to mount a successful attack. In an attack scenario, the victim uses a private key to create several signatures. The attacker observes the resulting signatures must know the messages being signed. Then, the cache-timing side-channel is used to infer order of operations and backfill to find the private key.

This attack could be used against TLS, or SSH, and there are mechanisms in both that would give an attacker enough information to perform a successful attack under certain circumstances. The key component again being local access to a server performing the DSA or ECDSA signing operation, or access to a virtual machine on the same host as long as cache isn't partitioned as it often is for public clouds.

Unlike the RSA flaw, a fix is not as simple as switching to constant-time operations. Instead, the fix involves adding a “blinding”) to the calculation. Blinding is a technique that can mask the underlying operation from side-channel inspection by inserting unpredictability which can be later reversed. This specific fix addresses the problematic addition (+) operation which exposes the side-channel leak. It does this by adding a random value as noise to both sides of the equation. Now, when observing the operation, it is theoretically impossible to remove the noise and discover the important information that would leak data.

Unintentional exposure of uninitialized memory in `Buffer` creation (CVE-2018-7166)

All versions of Node.js 10 are impacted by this flaw. Prior release lines are not impacted.

Node.js TSC member Сковорода Никита Андреевич (Nikita Skovoroda / @ChALkeR) discovered an argument processing flaw that causes causes Buffer.alloc() to return uninitialized memory. This method is intended to be safe and only return initialized, or cleared, memory.

Memory is not automatically cleared after use by most software and it is not generally cleared within Node.js during an application's lifetime when memory is freed from internal use. This means that a call to malloc() (system memory allocation) usually returns a block of memory that contains data stored by the previous user of that block who free()d it without clearing it. This can cause problems if an attacker can find a way to create these blocks and inspect their contents as secrets usually pass through memory—passwords, credit card numbers, etc. Allocate enough blocks of uncleared memory and you're bound to find something interesting.

In the browser, you have no way to allocate uninitialized memory, so a malicious site can't inspect your memory to find sensitive data arising from your interactions with another site. ArrayBuffer and the various TypedArray types will only ever give you initialized, or zeroed memory—memory that contains only 0s.

Historically, for the sake of performance, Node.js has acted more like a traditional un-sandboxed server-side runtime that doesn't need the same kinds of protections as browsers. Unfortunately, many JavaScript programmers are not as attuned to the risks of using uninitialized memory. Additionally, the Buffer constructor itself has some usability flaws that have lead to many expert programmers exposing uninitialized memory to potential attackers. ws, the very popular WebSocket library, authored by skilled programmers, famously exposed uninitialized memory to client connections over the network by means of a simple remote ping() call that passed an integer instead of a string.

The usability concerns around Buffer lead to the deprecation of the Buffer() constructor and introduction of new factory methods: Buffer.from(), Buffer.alloc(), Buffer.allocUnsafe(), and the --zero-fill-buffers command line argument. It's worth noting that from version 1.0, N|Solid, NodeSource's enterprise Node.js runtime, included a "zeroFillAllocations" option in its policies feature to address similar concerns.

Unfortunately, the root cause of Buffer constructor usability concerns—too much flexibility in argument types—is still with us, this time in Buffer#fill() who's signature is far too flexible: Buffer#fill(value[, offset[, end]][, encoding]). Internal re-use of this function, and its flexible argument parsing, by Buffer.alloc() exposes a bug that allows a supposedly safe allocation method to return unsafe (i.e. uninitialized) memory blocks.

Buffer.alloc() allows a third argument, encoding. When there is a second argument, fill, this and the encoding argument are passed blindly to the internal fill() implementation as second and third arguments. This is where it encounters the familiar Buffer() constructor problem:

function _fill(buf, val, start, end, encoding) {
 if (typeof val === 'string') {
   if (start === undefined || typeof start === 'string') {
     encoding = start;
     start = 0;
     end = buf.length;
   } else if (typeof end === 'string') {
     encoding = end;
     end = buf.length;
   }
   // ...

The intention here is that by only passing three arguments, with the third one being encoding, the flexible argument parsing rules would enter the top set of instructions and set encoding = start, start = 0, end = buf.length, precisely what we want for a Buffer fully initialized with the provided val. However, because Buffer.alloc() does minimal type checking of its own, the encoding argument could be a number and this whole block of argument rewriting would be skipped and start could be set to some arbitrary point in the Buffer, even the very end, leaving the whole memory block uninitialized:

> Buffer.alloc(20, 1)

> Buffer.alloc(20, 'x')

> Buffer.alloc(20, 1, 20)

// whoops!

This is only a security concern if you are allowing unsanitized user input to control the third argument to Buffer.alloc(). Unless you are fully sanitizing and type-checking everything coming in from an external source and know precisely what types are required by your dependencies, you should not assume that you are not exposed.

The fix for CVE-2018-7166 simply involves being explicit with internal arguments passed from alloc() to fill() and bypassing the argument shifting code entirely. Avoiding argument cleverness is a good rule to adopt in any case for robustness and security.

Out of bounds (OOB) write in `Buffer` (CVE-2018-12115)

All actively supported release lines of Node.js are impacted by this flaw.

Node.js TSC member Сковорода Никита Андреевич (Nikita Skovoroda / @ChALkeR) discovered an OOB write in Buffer that can be used to write to memory outside of a Buffer's memory space. This can corrupt unrelated Buffer objects or cause the Node.js process to crash.

Buffer objects expose areas of raw memory in JavaScript. Under the hood, this is done in different ways depending on how the Buffer is created and how big it needs to be. For Buffers less than 8k bytes in length created via Buffer.allocUnsafe() and from most uses of Buffer.from(), this memory is allocated from a pool. This pool is made up of areas of block-allocated memory larger than an individual Buffer. So Buffers created sequentially will often occupy adjoining memory space. In other cases, memory space may sit adjacent with some other important area of memory used by the current application—likely an internal part of V8 which makes heaviest use of memory in a typical Node.js application.

CVE-2018-12115 centers on Buffer#write() when working with UCS-2 encoding, (recognized by Node.js under the names 'ucs2', 'ucs-2', 'utf16le' and 'utf-16le') and takes advantage of its two-bytes-per-character arrangement.

Exploiting this flaw involves confusing the UCS-2 string encoding utility in Node.js by telling it you wish to write new contents in the second-to-last position of the current Buffer. Since one byte is not enough for a single UCS-2 character, it should be rejected without changing the target Buffer, just like any write() with zero bytes is. The UCS-2 string encoding utility is written with the assumption that it has at least one whole character to write, but by breaking this assumption we end up setting the "maximum number of characters to write" to -1, which, when passed to V8 to perform the write, is interpreted as "all of the buffer you provided".

UCS-2 encoding can therefore be tricked to write as many bytes as you want from the second-to-last position of a Buffer on to the next area of memory. This memory space may be occupied by another Buffer in the application, or even to another semi-random memory space within our application, corrupting state and potentially causing an immediate segmentation fault crash. At best this can be used for a denial of service by forcing a crash. At worst, it could be used to overwrite sensitive data to trick an application into unintended behavior.

As with CVE-2018-7166, exploiting this flaw requires the passing of unsanitized data through to Buffer#write(), possibly in both the data to be written and the position for writing. Unfortunately, this is not an easy scenario to recognize and such code has been found to exist in npm packages available today.

The fix for CVE-2018-12115 involves checking for this underflow and bailing early when there really are no full UCS-2 characters to write.

The Truth About Rod Vagg

2017-08-25T00:00:00.000Z

NOTE: This post is copied from https://github.com/nodejs/CTC/issues/165#issuecomment-324798494 and the primary intended audience was the Node.js CTC.

Dear reader from ${externalSource}: I neither like nor support personal abuse or attacks. If you are showing up here getting angry at any party involved, I would ask you to refrain from targeting them, privately or in public. Specifically to people who think they may be supporting me by engaging in abusive behaviour: I do not appreciate, want or need it, in any form and it is not helpful in any way.

Yep, this is a long post, but no apologies for the length this time. Buckle up.

I'm sad that we have reached this point, and that the CTC is being asked to make such a difficult decision. One of the reasons that we initially split the TSC into two groups was to insulate the technical doers on the CTC from the overhead of administrative and political tedium. I know many of you never imagined you'd have to deal with something like this when you agreed to join and that this is a very uncomfortable experience for you.

It's obvious that we never figured out a suitable structure that made the TSC a useful, functional, and healthy body that might be able to deal more effectively with these kinds of problems, more isolated from the CTC. I'm willing to accept a sizeable share of the blame for not improving our organisational structure during my tenure in leadership.

My response

Regarding the request for me to resign from the CTC: in lieu of clear justification that my removal is for the benefit of the Node.js project, or a case for my removal that is not built primarily on hearsay and innuendo, I respectfully decline.

There are two primary reasons for which I am standing my ground.

I cannot, in good conscience, give credence to the straw-man version of me being touted loudly on social media and on GitHub. This caricature of me and vague notions regarding my "toxicity", my propensity for "harassment", the "systematic" breaking of rules and other slanderous claims against my character has no basis in fact. I will not dignify these attacks by taking tacit responsibility through voluntary resignation.

Secondly, and arguably more importantly for the CTC: I absolutely will not take responsibility for the precedent that is currently being set. The dogged pursuit of a leader of this project, the strong-arm tactics being deployed with the goal of having me voluntarily resign, or my eventual removal from this organisation are not the behavior of a healthy, productive, or inclusive community.

My primary concern is that the consequences of these actions endanger the future health of the Node.js project. I do not believe that I am an irreplaceable snowflake (I’m entirely replaceable). There is reason to pause before making this an acceptable part of how we conduct our governance and our internal relationships.

However, while I am not happy to have the burden of this decision being foisted upon all of you, I am content with standing to be judged by this group. As the creative force behind Node.js and the legitimate owners of this project, my respect for you as individuals and as a group and your rightful position as final arbiters of the technical Node.js project makes entirely comfortable living with whatever decision you arrive at regarding my removal.

I will break the rest of this post into the following sections:

My critique of the process so far
My response to list of complaints made against me to the TSC
Addressing the claims often repeated across the internet regarding me as a hinderance to progress on inclusivity and diversity
The independence of the technical group, the new threats posed to that independence
The threats posed to future leadership of the project

The process so far

My personal experience so far has been approximately as follows:

Some time ago I received notification via email that there are complaints against me. No details were provided and I was informed that I would neither receive those details or be involved in the whatever process was to take place. Further, TSC members were not allowed to speak to me directly about these matters, including my work colleagues also on the TSC. I was never provided with an opportunity to understand the specific charges against me or be involved in any discussions on this topic from that point onward.
3 days ago, I saw nodejs/TSC#310 at the same time as the public. This was the first time that I had seen the list of complaints. It was the first that I heard that there was a vote taking place regarding my position.
At no point have I been provided with an opportunity to answer to these complaints, correct the factual errors contained in them (see below), apologise and make amends where possible, or provide additional context that may further explain accusations against me.
At no point have I been approached by a member of the TSC or CTC regarding any of these items other than what the record that we have here on GitHub shows—primarily in the threads involved and in the moderation repository, the record is open for you to view regarding the due diligence undertaken either by my accusers or those executing the process. I have had interactions with only a single member of the TSC regarding one of these matters in private email and in person which has, on both occasions, involved me attempting to coax out the source of bad feelings that I had sensed and attempting to (relatively blindly) make amends.

I hope you can empathise that to me this process is rather unfair and regardless of whether this process is informed or dictated by our governance documents as has been claimed, it should be changed so that in the future accused parties have the chance to at least respond to accusations.

Response to the list of complaints

I am including the text that was redacted from nodejs/TSC#310 as it is already in the public domain, on social media, also on GitHub and now in the press. Please note that I did not ask for this text to be redacted.

1.

In [link to moderation repository discussion, not copied here out of respect for additional parties involved], Rod’s first action was to apologize to a contributor who had been repeatedly moderated. Rod did not discuss the issue with other members of the CTC/TSC first. The result undermined the moderation process as it was occurring. It also undercut the authority as moderators of other CTC/TSC members.

Rather than delving into the details of this complaint, I will simply say that I was unaware at the time that the actions I had taken were inappropriate and had caused hurt to some CTC/TSC members involved in this matter. Having had this belatedly explained to me (again, something I have had to coax out, not offered freely to me), I issued a private statement to the TSC and CTC via email at the beginning of this month offering my sincere apologies. (I did this without knowing whether it was part of the list of complaints against me.) The most relevant part of my private statement is this:

In relation to my behaviour in the specific: I should not have weighed in so heavily, or at all, in this instance as I lacked so much of the context of what was obviously a very sensitive matter that was being already dealt with by some of you (in a very taxing way, as I understand it). I missed those signals entirely and weighed in without tact, took sides against some of you—apologising to [unnecessary details withheld] on behalf of some of you was an absurd thing for me to do without having being properly involved prior to this. And for this I unreservedly apologise!

I don't know if this apology was acknowledged during the process of dealing with the complaints against me. This apology has neither been acknowledged in the publication of the complaints handling process, nor has it seemed to have any impact on the parties involved who continue to hold it against me. I can only assume that they either dismiss my sincerity or that apologies are not a sufficient means of rectifying these kinds of missteps.

In this matter I accept responsibility and have already attempted to make amends and prevent a similar issue from recurring. It disappoints me that it is still used as an active smear against me. Again, had I been given clear feedback regarding my misstep earlier, I would have attempted to resolve this situation sooner.

2.

In nodejs/board#58 and nodejs/moderation#82 Rod did not moderate himself when asked by another foundation director and told them he would take it to the board. He also ignored the explicit requests to not name member companies and later did not moderate the names out of his comments when requested. Another TSC member needed to follow up later to actually clean up the comments. Additionally he discussed private information from the moderation repo in the public thread, which is explicitly against the moderation policy.

My response to this complaint is as follows:

This thread unfortunately involves a significant amount of background corporate politics, personal relationship difficulties and other matters which conspired to raise the temperature, for me at least. This is not an excuse, simply an explanation for what may have appeared to some to be a heated interjection on my part.
I did edit my post very soon after—I was the first to edit my posts in there after the quick discussion that followed in the moderation repository and I realised I had made a poor judgement call with my choice of words. I both removed my reading of intent into the words of another poster and removed the disclosure of matters discussed in a private forum.
I do not recall being asked to remove the names of the companies involved, I have only now seen that they have been edited out of my post. I cannot find any evidence that such a request was even made. This would have been a trivial matter on my part and I would have done it without argument if I had have seen such a request. To find this forming the basis of a complaint is rather troubling without additional evidence.
A board member asking another board member (me) to edit their postings seemed to me to be a board matter, hence my suggestion to take it to the board. I was subsequently corrected on this—as it is a TSC-owned repository it was therefore referred to the TSC for adjudication.

I considered the remaining specifics of this issue to have been resolved and have not been informed otherwise since this event took place. Yet I now find that the matters are still active and I am the target of criticism rather than that criticism being aimed at the processes that apparently resolved the matter in the first place. Why was I never informed that my part in the resolution was unsatisfactory and why was I not provided a chance to rectify additional perceived misdeeds?

3.

Most recently Rod tweeted in support of an inflammatory anti-Code-of-Conduct article. As a perceived leader in the project, it can be difficult for outsiders to separate Rod’s opinions from that of the project. Knowing the space he is participating in and the values of our community, Rod should have predicted the kind of response this tweet received. https://twitter.com/rvagg/status/887652116524707841

His tweeting of screen captures of immature responses suggests pleasure at having upset members of the JavaScript community and others. As a perceived leader, such behavior reflects poorly on the project. https://twitter.com/rvagg/status/887790865766268928

Rod’s public comments on these sorts of issues is a reason for some to avoid project participation. https://twitter.com/captainsafia/status/887782785221615618

It is evidence to others that Node.js may not be serious about its commitment to community and inclusivity. https://twitter.com/nodebotanist/status/887724138516951049

The post I linked to was absolutely not an anti-Code-of-Conduct article. It was an article written by an Associate Professor of Evolutionary Psychology at the University of New Mexico, discussing free speech in general and suggesting a case against speech codes in American university campuses. In sharing this, I hoped to encourage meaningful discussion regarding the possible shortcomings of some standard Code of Conduct language. My intent was not to suggest that the Node.js project should not have a Code of Conduct in place.
"Rod should have predicted the kind of response this tweet received" is a deeply normative statement. I did not predict the storm generated, and assumed that open discussion on matters of speech policing was still possible, and that my personal views would not be misconstrued as the views of the broader Node.js leadership group or community. I obviously chose the wrong forum. If TSC/CTC members are going to be held responsible for attempting to share or discuss personal views on personal channels, then that level of accountability should be applied equally across technical and Foundation leadership.
"His tweeting of screen captures of immature responses suggests pleasure" is an assumption of my feelings at the time. I find this ironic especially in the context of complaint number 2 (above); I was criticised for reading the intention of another individual into their words yet that’s precisely what is being done here. This claim is absolutely untrue, I do not take pleasure in upsetting people. I will refrain from justifying my actions further on this matter but this accusation is baseless and disingenuous.
To re-state for further clarity, I have not made a case against Codes of Conduct in general, but rather, would like to see ongoing discussion about how such social guidelines could be improved upon, as they clearly have impact on open source project health.
I have never made a case against the Node.js Code of Conduct.
I have a clear voting record for adopting the Node.js project's Code of Conduct and for various changes made to it. Codes of Conduct have been adopted by a number of my own projects which have been moved from my own GitHub account to that of the Node.js Foundation.

I will refrain from further justifying a tweet. As with all of you, I bring my own set of opinions and values to our diverse mix and we work to find an acceptable common space for us all to operate within. I don’t ask that you agree with me, but within reason I hope that mutual respect is stronger than a single disagreement. I cannot accept that my opinions on these matters form a valid reason for my removal. I have submitted myself to our Code of Conduct as a participant in this project. I have been involved in the application of our Code of Conduct. But I do not accept it as a sacred text that is above critique or even discussion.

While not a matter for the TSC or CTC, a Board member on the Foundation who (by their own admission), has repeatedly discussed sensitive and private Board matters publicly on Twitter, causing ongoing consternation and legal concern for the Board. As far as I know, this individual has not been asked to resign. I consider this type of behaviour to be considerably more problematic for the Foundation than my tweeting of a link to an article completely unrelated to Node.js.

Taking action against me on the basis of this tweet, while ignoring the many tweets and other social media posts that stand in direct conflict to the goals of the Foundation by other members of our technical team, its leadership and other members of the Foundation and its various bodies, strikes me as a deeply unequal (and, it must be said, un-inclusive) application of the rules.

If it is the case that the TSC/CTC is setting limits on personal discussion held outside the context of the project repo, then these limits should be applied to all members of both groups without prejudice.

Board accusations

In addition to the above list, we now have new claims from the Node.js Foundation board. It appears to suggest that I have and/or do engage in “antagonistic, aggressive or derogatory behavior”, with no supporting evidence provided. Presumably the supporting evidence is the list in nodejs/TSC#310 to which I have responded with above.

I can’t respond to an unsupported claim such as this, it’s presented entirely without merit and I cannot consider it anything other than malicious, self-serving, and an obvious attempt to emotionally manipulate the TSC and CTC by charging the existing claims with a completely new level of seriousness by the sprinkling of an assortment of stigmatic evil person descriptors.

To say that I am disappointed that a majority of the Board would agree to conduct themselves in such an unprofessional and immature manner is an understatement. However this is neither the time nor place for me to attempt to address their attempts to smear, defame and unperson me. After requesting of me directly that I “fall on my sword” and not receiving the answer it wanted, the Board has chosen to make it clear to where it collectively thinks the high moral ground is in this matter. As I have already expressed to them, I believe they have made a poor assessment of the facts, and have not made the correct choice on their moral stance, and have now stood by and encouraged additional smears against me.

I will have more to say on the Board’s role and our relationship to it below, however.

That I am a barrier to inclusivity efforts

This is a refrain that is often repeated on social media about me and it's never been made clear, to me at least, how this is justified.

By most objective measures, the Node.js project has been healthier and more open to outsiders during my 2-year tenure in leadership than at any time in its history. One of the great pleasures I've had during this time has been in showing and celebrating this on the conference circuit. We have record numbers of contributors overall, per month overall and unique per month. Our issue tracker is so busy with activity that very few of us can stay subscribed to the firehose any more. We span the globe such that our core and working group meetings are very difficult to schedule and usually have to end up leaving people out. We regularly have to work to overcome language and cultural barriers as we continue to expand.

When I survey the contributor base, the collaborator list, the CTC membership, I see true diversity across many dimensions. Claims that I am a barrier to inclusivity and the building of a diverse contributor base are at odds with the prominent role I've had in the project during its explosive growth.

My assessment of the claim that I am a hindrance to inclusivity efforts is that it hinges on the singular matter of moderation and control of discourse that occurs amongst the technical team. From the beginning I have strongly maintained that the technical team should retain authority over its own space. That its independence also involves its ability to enforce the rules of social interaction and discussion as it sees fit. This has lead to disagreements with individuals that would rather insert external arbiters into the moderation process; arbiters who have not earned the right to stand in judgement of technical team members, and have not been held to the same standards by which technical team members are judged to earn their place in the project.

On this matter I remain staunchly opposed to the dilution of independence of the technical team and will continue to advocate for its ability to make such critical decisions for itself. This is not only a question of moral (earned) authority, but of the risk of subversion of our organisational structures by individuals who are attracted to the project by the possibility of pursuing a personal agenda, regardless of the impact this has on the project itself. I see current moves in this direction, as in this week’s moderation policy proposal at nodejs/TSC#276, as presenting such a risk. I don't expect everyone to agree with me on this, but I have just as much right as everyone else to make my case and not be vilified in my attempts to convince enough of the TSC to prevent such changes.

Further, regarding other smears against my character that now circulate regularly on social media and GitHub. I would ask that if you are using any of these as the basis of your judgement against me, please ask for supporting evidence of those making or repeating such smears. It's been an educational experience to watch a caricatured narrative about my character grow into the monster that it is today, and it saddens me when people I respect take this narrative at face value without bothering to scratch the surface to see if there is any basis in fact.

The use of language such as “systematic” and “pattern” to avoid having to outline specifics should be seen for what they are: baseless smears. I have a large body of text involving many hundreds of social interactions scattered through the Node.js project and its various repositories on GitHub. If any such “systematic” behavioural problems exist then it should not be difficult to provide clear documentation of them.

Threats to the independence of the technical group

We now face the unprecedented move by the Node.js Foundation Board to inject itself directly in our decision-making process. The message being: the TSC voted the wrong way, they should do it again until you get the “right” outcome.

This echoes the sentiment being expressed in the Community Committee and elsewhere, that since there were accusations, there must be guilt and the fault lies in the inability of the TSC to deal with that guilt. With no credence paid to the possibility that perhaps the TSC evaluated the facts and reached a consensus that no further action was necessary.

I have some sympathy for the position of the Node.js Foundation board. These are tough times in the Silicon Valley environment, particularly with the existing concerns surrounding diversity, inclusivity, and tolerance. I can understand how rumors of similarly unacceptable behavior can pose a threat, even absent any evidence of such behavior. That said, I do not believe that it is in the long-term interests of Node.js or its Foundation to pander to angry mobs, as they represent a small fraction of our stakeholders and their demands are rarely rational. In this case, I believe that a majority of outsiders will be viewing this situation with bemusement at best. It saddens me that there is no recognition of the fact that appeasing angry and unverified demands by activists only leads to greater demands and less logical discussion of these issues. If we accept this precedent then we place the future health of this project in jeopardy, as we will have demonstrated that we allow outsiders to adjust our course to suit personal or private agendas, as long as they can concoct a story to create outrage and dispense mob justice without reproach.

While difficult, I believe that it is important for the technical team to continue to assert its independence, to the board and to other outside influences. We are not children who need adult supervision; treating us as such undermines so much of what we have built over these last few years and erodes the feelings of ownership of the project that we have instilled in our team of collaborators.

The threat to future leadership of the project

Finally, I want to address a critical problem which has been overlooked, but now poses a big problem for our future: how to grow, enable and support leadership in such a difficult environment.

My tenure in leadership easily represents the most difficult years of my life. The challenges I have had to face have forced me to grow in ways I never expected. I'm thankful for the chance to meet these challenges, however, and even though it's taken a toll on my health, I'll be glad to have had the experience when I look back.

One of my tasks as a leader, particularly serving in the role of bridge between the Board and the technical team, has involved maintaining that separation and independence but also shielding the technical team from the intense corporate and personal politics that constantly exists and is being exercised within, and around the Foundation. This role forced me to take strong positions on many issues and to stand up to pressure applied from many different directions. In doing what I felt was best to support my technical team members I’m sure I’ve put people off-side—that's an unfortunate consequence of good intentions, but not an uncommon one. I wouldn't say I've made enemies so much as had to engage in very difficult conversations and involve myself in the surfacing of many disagreements that are difficult and sometimes impossible to resolve.

Having to involve yourself in a wide variety of decision-making processes inevitably requires that you make tough calls or connect yourself in some way to controversial discussions. I'm sure our current leadership can attest to the awkward positions they have found themselves in, and the difficult conversations they have had to navigate, including this one!

I'll never pretend I don't have limitations in the skills, both intellectually and emotionally, required to navigate through these tough waters. But when I consider the sheer number of dramas, controversies, and difficult conversations I've had to be involved in—and when I consider the thousands of pages of text I have left littered across GitHub and the other forums we use to get things done—I come to this conclusion: If the best reason you can find force my resignation is the above list of infractions, given the weight of content you could dredge through, then you're either not trying very hard or I should be pretty proud of myself for keeping a more level head than I had imagined.

That aside, my greatest concern for the role of leadership coming as a consequence of the actions currently being pursued, is that we've painted ourselves into a corner regarding the leaders we're going to have available. The message that the Board has chosen to send today can be rightly interpreted as this: if the mob comes calling, if the narrative of evil is strong enough, regardless of the objective facts, the Foundation does not have your back. As developers and leaders, the Foundation is signalling that they will not stand up for us when things get tough. Combine this with a difficult and thankless job, where the result of exercising your duties could be career-killing, the only path forward for leadership is that we will likely only have:

Individuals who are comfortable giving in to the whims of the outside activists, whatever the demands, slowly transforming this project into something entirely different and focused on matters not associated with making Node.js a success
Individuals who are capable but shrewd enough to avoid responsibility
Individuals who are capable and take on responsibility, exercise backbone when standing against pressure groups and mob tactics but get taken down because the support structures either abandon them or turn against them

This kind of pattern is being evidenced across the professionalised open source sphere, with Node.js about to set a new low bar. Do not be surprised as quality leaders become more difficult to find or become unconvinced that the exercise of leadership duties is at all in their personal interest.

This is a great challenge for modern open source and I'm so sad that I am being forced to be involved in the setting of our current trajectory. I hope we can find space in the future to have the necessary dialog to find a way out of the hole being dug.

In summary

Obviously I hope that you agree that (a) this action against me is unwarranted, is based on flawed and/or irrelevant claims of “misbehaviour” and is based in malicious intent, and that (b) allowing this course of action to be an acceptable part of our governance procedures will have detrimental consequences for the future health of the project.

I ask the CTC to reject this motion, for the TSC to reject the demand by the Board for my suspension, and that we as a technical team send a signal that our independence is critical to the success of the project, despite the accusations of an angry mob.

Thank you if you dignified my words by reading this far!

Why I don't use Node's core 'stream' module

2014-06-14T00:00:00.000Z

This article was originally offered to nearForm for publishing and appeared for some time on their blog from early 2014 (at this URL: http://www.nearform.com/nodecrunch/dont-use-nodes-core-stream-module). It has since been deleted. I'd rather not speculate about the reasons for the deletion but I believe the article contains a very important core message so I'm now republishing it here.

TL;DR

The "readable-stream" package available in npm is a mirror of the Streams2 and Streams3 implementations in Node-core. You can guarantee a stable streams base, regardless of what version of Node you are using, if you only use "readable-stream".

The good 'ol days

Prior to Node 0.10, implementing a stream meant extending the core Stream object. This object was simply an EventEmitter that added a special pipe() method to do the streaming magic.

Implementing a stream usually started with something like this:

var Stream = require('stream').Stream
var util = require('util')

function MyStream () {
  Stream.call(this)
}

util.inherits(MyStream, Stream)

// stream logic, implemented however you want

If you ever had to write a non-trivial stream implementation for pre-Node 0.10 without using a helper library (such as through), you know what a nightmare the state-management it can be. The actual implementation of a custom stream is a lot more than just the above code.

Welcome to Node 0.10

Thankfully, Streams2 came along with a brand new set of base Stream implementations that do a whole lot more than pipe(). The biggest win for stream implementers comes from the fact that state-management is almost entirely taken care of for you. You simply need to provide concrete implementations of some abstract methods to make a fully functional stream, even for non-trivial workloads.

Implementing a stream now looks something like this:

var Readable = require('stream').Readable
// `Stream` is still provided for backward-compatibility
// Use `Writable`, `Duplex` and `Transform` where required
var util = require('util')

function MyStream () {
  Readable.call(this, { /* options, maybe `objectMode:true` */ })
}

util.inherits(MyStream, Readable)

// stream logic, implemented mainly by providing concrete method implementations:

MyStream.prototype._read = function (size) {
  // ... 
}

State-management is handled by the base-object and you interact with internal methods, such as this.push(chunk) in the case of a Readable stream.

While the internal streams implementations are an order-of-magnitude more complex than the previous core-streams implementation, most of it is there to make life an order-of-magnitude easier for those of us implementing custom streams. Yay!

Backward-compatibility

When every new major stable release of Node occurs, anyone releasing public packages in npm has to make a decision about which versions of Node they support. As a general rule, the authors of the most popular packages in npm will support the current stable version of Node and the previous stable release.

Streams2 was designed with backwards-compatibility in mind. Streams using require('stream').Stream as a base will still mostly work as you'd expect and they will also work when piped to streams that extend the other classes. Streams2 streams won't work like classic EventEmitter objects when you pipe them together, as old-style streams do. But when you pipe a Streams2 stream and an old-style EventEmitter-based stream together, Streams2 will fall-back to "compatibility-mode" and operate in a backward-compatible way.

So Streams2 are great and mostly backward-compatible (aside from some tricky edge cases). But what about when you want to implement Streams2 and run on Node 0.8? And what about open source packages in npm that want to still offer Node 0.8 compatibility while embracing the new Streams2-goodness?

"readable-stream" to the rescue

During the 0.9 development phase, prior to the 0.10 release, Isaac developed the new Streams2 implementation in a package that was released in npm and usable on older versions of Node. The readable-stream package is essentially a mirror of the streams implementation of Node-core but is available in npm. This is a pattern we will hopefully be seeing more of as we march towards Node 1.0. Already there is a core-util-is package that makes available the shiny new is type-checking functions in the 0.11 core 'util' package.

readable-stream gives us the ability to use Streams2 on versions of Node that don't even have Streams2 in core. So a common pattern for supporting older versions of Node while still being able to hop on the Streams2-bandwagon starts off something like this, assuming you have "readable-stream" as a dependency:

var Readable = require('stream').Readable || require('readable-stream').Readable

This works because there is no Readable object on the core 'stream' package in 0.8 and prior, so if you are running on an older version of Node it skips straight to the "readable-stream" package to get the required implementation.

Streams3: a new flavour

The readable-stream package is still being used to track the changes to streams coming in 0.12. The upcoming Streams3 implementation is more of a tweak than a major change. It contains an attempt to make "compatibility mode" more of a first-class citizen of the API and also some improvements to pause/resume behaviour.

Like Streams2, the aim with Streams3 is for backward (and forward) compatibility but there are limits to what can be achieved on this front.

While this new streams implementation will likely be an improvement over the current Streams2 implementation, it is part of the unstable development branch of Node and is so far not without its edge cases which can break code designed against the pure 0.10 versions of Streams2.

What is your base implementation?

Looking back at the code used to fetch the base Streams2 implementation for building custom streams, let's consider what we're actually getting with different versions of Node:

var Readable = require('stream').Readable || require('readable-stream').Readable

Node 0.8 and prior: we get whatever is provided by the readable-stream package in our dependencies.
Node 0.10: we get the particular version of Streams2 that comes with the version of Node we're using.
Node 0.11: we get the particular version of Streams3 that comes with the version of Node we're using.

This may not be interesting if you have full control over all deployments of your custom stream implementations and which version(s) of Node they will be used on. But it can cause some problems in the case of open source libraries distributed via npm with users still stuck on 0.8 (for some, the upgrade path is not an easy one for various reasons), 0.10 and even people trying out some of the new Node and V8 features available in 0.11.

What you end up with is a very unstable base upon which to build your streams implementation. This is particularly acute since the vast bulk of the code used to construct the stream logic is coming from either Node-core or the readable-stream package. Any bugs fixed in later Node 0.10 releases will obviously still be present for people still stuck on earlier 0.10 releases even if the readable-stream dependency has the fixed version.

Then, when your streams code is run on Node 0.11, suddenly it's a Streams3 stream which has slightly different behaviour to what most of your users are experiencing.

One of the ways these subtle differences are exposed is in bug reports. Users may report a bug that only occurs on their particular combination of core-streams and readable-stream and it may not be obvious that the problem is related to base-stream implementation edge-cases they are stumbling upon; wasting time for everyone.

And what about stability? The fragmentation introduced by all of the possible combinations means that your otherwise stable library is having instability foisted upon it from the outside. This is one of the costs of relying on a featureful standard-library (core) within a rapidly developing, pre-v1 platform. But we can do something about it by taking control of the exact version of the base streams objects we want to extend regardless of what is bundled in the version of Node being used. readable-stream to the rescue!

Taking control

To control exactly what code your streams implementation is building on, simply pin the version of readable-stream and use only it, avoiding require('stream') completely. Then you get to make the choice when to upgrade to Streams3, even if that's some time after Node 0.12.

readable-stream comes in two major versions, v1.0.x and v1.1.x. The former tracks the Streams2 implementation in Node 0.10, including bug-fixes and minor improvements as they are added. The latter tracks Streams3 as it develops in Node 0.11; we may see a v1.2.x branch for Node 0.12.

Any library worth using should be following the basics of semver minor and patch versions (the merits and finer points of major versioning are still something worth debating). readable-stream gives you proper patch-level versioning so if you pin to "~1.0.0" you'll get the latest Node 0.10 Streams2 implementation, including any fixes and minor non-breaking improvements. The patch-level version of 1.0.x and 1.1.x should mirror the patch-level versions of Node core releases as we proceed.

When you're ready to start using Streams3 you can pin to "~1.1.0", but you should hold off until much closer to Node 0.12, if not after its formal release.

Small core FTW!

Being able to control precisely the versions of dependencies your code uses reduces the scope for bugs introduced by version incompatibilities or new and unproven implementations.

When we rely on a bulky standard-library to build our libraries and applications, we're relying on a shifting sand that we have little control over. This is particularly a problem for open source libraries whose users have legitimate (and sometimes not-so-legitimate) reasons for using versions that you'd rather not have to support.

Streams2 is a powerful abstraction, but the implementation is far from simple. The Streams2 code is some of the most complex JavaScript you'll find in Node core. Unless you want to have a detailed understanding of how they work and be able to track the changes as they develop, you should pin your Streams2 dependency in the same way as you pin all your other dependencies. Opt for readable-stream over what Node-core offers:

{
  "name": "mystream",
  ...
  "dependencies": {
    "readable-stream": "~1.0.0"
  }
}

var Readable = require('readable-stream').Readable
var util = require('util')

function MyStream () {
  Readable.call(this)
}

util.inherits(MyStream, Readable)

MyStream.prototype._read = function (size) {
  // ... 
}

Addendum: "through2"

If the boilerplate of the Streams2 base objects ("classes") is too much for you or triggers some past-life Java PTSD, you can just opt for the "through2" package in npm to get the job done.

through2 is based on Dominic Tarr's through but is built for Streams2, whereas "through" is a pure Streams1 style. The API isn't quite the same but the flexibility and simplicity is.

through2 gives you a DuplexStream as a base to implement any kind of stream you like, be it as purely readable, purely writable or a fully duplex stream. In fact, you can even use through2 to implement a PassThrough stream by not providing an implementation!

From the examples:

var through2 = require('through2')

fs.createReadStream('ex.txt')
  .pipe(through2(function (chunk, enc, callback) {

    for (var i = 0; i < chunk.length; i++)
      if (chunk[i] == 97)
        chunk[i] = 122 // swap 'a' for 'z'

    this.push(chunk)

    callback()

   }))
  .pipe(fs.createWriteStream('out.txt'))

Or an object stream:

fs.createReadStream('data.csv')
  .pipe(csv2())
  .pipe(through2.obj(function (chunk, enc, callback) {

    var data = {
        name    : chunk[0]
      , address : chunk[3]
      , phone   : chunk[10]
    }

    this.push(data)

    callback()

  }))
  .on('data', function (data) {
    all.push(data)
  })
  .on('end', function () {
    doSomethingSpecial(all)
  })

NodeSchool comes to Australia

2014-06-14T00:00:00.000Z

NodeSchool has its genesis ultimately at NodeConf 2013 where @substack introduced us to stream-adevnture. I took the concept home to CampJS and wrote learnyounode for the introduction to Node.js workshop I was to run. As part of the process I extracted a package called workshopper to do the work of making a terminal workshop experience. Most of the logic originally came from stream-adventure. A short time after CampJS, I created levelmeup for NodeConf.eu and suddenly we had a selection of what have come to be known as "workshoppers". At NodeConf.eu, @brianloveswords suggested the NodeSchool concept and registered the domain, @substack provided the artwork and the ball was rolling.

Today, workshopper is depended on by 22 packages in npm, most of which are workshoppers that you can install and use to learn aspects of Node.js or JavaScript in general. The curated list of usable workshoppers is maintained at nodeschool.io.

learnyounode itself is now being downloaded at a rate of roughly 200 per day. That's at least 200 more people each day wanting to learn how to do Node.js.

I can't recall exactly how "NodeSchool IRL" events started but it was probably @maxogden who has been responsible for a large number of these events. There have now been over 30 of these events around the world and the momentum is only picking up steam. The beauty of this format is that it's low-cost and low-effort to make it happen. All you need is a venue where nerds can show up with their computers and some basic guidance. There have even been a few events that were without experienced Node.js mentors but that's no great barrier as the lessons are largely self-guided and work particularly well when pairs or groups of people work together on solutions.

NodeSchool comes to Australia

It's surprising that so far, given all of the NodeSchool activity around the world, that we haven't had a single NodeSchool event in Australia. CampJS had learnyounode last year and this year there were 3 brand new workshoppers introduced there, so it's the closest thing we've had.

Next weekend, on the 21st of June, we are attempting a coordinated Australian NodeSchool event. At the moment, that coordination amounts to having events hosted in Sydney, Melbourne and Hobart, unfortunately the timing has been difficult for Brisbane and we haven't managed to bring anyone else out of the woodwork. But, we will be attempting to do this regularly, plus we'd like to encourage meet-up hosts to use the format now and again with their groups.

NodeSchool in Sydney

I'll be at NodeSchool in Sydney next weekend. It will be proudly hosted by NICTA who have a space for up to 60 people. NICTA are currently doing some interesting work with WebRTC, you should catch up with @DamonOehlman if this is something you're interested in. Tabcorp will also be a major sponsor of the event. Tabcorp have been building a new digital team with the back-end almost entirely in Node.js. They are also doing a great job engaging and contributing to existing and new open source projects. They are also hiring so be sure to catch up with @romainprieto if you're doing PHP, Java or some other abomination and want to be doing Node!

Thanks to the sponsorship, we'll be able to provide some catering for the event. Currently we're looking at providing lunch but we may be expanding that to providing some breakfast treats. We'll also be providing refreshments for everyone attending throughout the day.

Start time is 9.30am, end is 4pm. The plan is to spend the first half of the day doing introductory Node.js which will mainly mean working through learnyounode. The second half of the day will be less structured and we'll encourage attendees to work on other workshoppers that they find interesting. Thankfully we have some amazing Node.js programmers in Sydney and they'll be available as mentors.

We are currently selling tickets for $5, the money will contribute towards the event, there is no profit involved here. We don't need to charge for the event but given the generally dismal turnout for tech meet-ups that are free we feel that providing a small commitment barrier will help us maximise the use of the space we have available. If the money is a barrier for you please contact us! We don't want anyone to miss out. Also, we have special "mentor" tickets available for experienced Node.js programmers who are able to assist. If you think you fit into this category please contact us also.

You can sign up for Sydney NodeSchool at https://ti.to/nodeschool/sydney-june-2014/. If you are tempted, don't sit on the fence because spots are limited and as of writing the tickets are almost 1/2 gone.

NodeSchool in Melbourne

NodeSchool in Melbourne is being supported by ThoughtWorks who have been doing Node in Australia for a while now. If you're interested in their services or want to chat about employment opportunities you should catch up with Liauw Fendy.

@sidorares is putting in a large amount of the legwork in to Melbourne's event. He was a major contributor to the original learnyounode and is a huge asset to the Melbourne Node community. Along with Andrey, Melbourne has large number of expert Node.js hackers, many of who will be available as mentors. This will be a treat for Melbournians so this is not something you should miss out on if you are in town! Potential mentors should contact Andrey.

You can sign up for Melbourne NodeSchool at https://ti.to/nodeschool/melbourne-june-2014/.

NodeSchool in Hobart

Hobart is lucky to have @joshgillies, a local tech-community organiser responsible for many Tasmanian web and JavaScript events. The event is being hosted at The Typewriter Factory, a business workspace that Josh helps run. Sponsorship is being provided by ACS who will be helping support the venue and also provide some catering.

You can sign up for Hobart NodeSchool at https://ti.to/nodeschool/hobart-june-2014/.

Testing code against many Node versions with Docker

2013-11-26T00:00:00.000Z

I haven't found reason to play with Docker until now, but I've finally came up with an excellent use-case.

NAN is a project that helps build native Node.js add-ons while maintaining compatibility with Node and V8 from Node versions 0.8 onwards. V8 is currently undergoing major internal changes which is making add-on development very difficult; NAN's purpose is to abstract that pain. Instead of having to manage the difficulties of keeping your code compatible across Node/V8 versions, NAN does it for you. But this means that we have to be sure to keep NAN tested and compatible with all of the versions it claims to support.

Travis can help a little with this. It's possible to use nvm to test across different versions of Node, we've tried this with NAN (see the .travis.yml). Ideally you'd have better choice of Node versions, but Travis have had some difficulty keeping up. Also, npm bugs make it difficult, with a high failure rate from npm install problems, like this and this, so I don't even publish the badge on the NAN README.

The other problem with Travis is that it's a CI solution, not a proper testing solution. Even if it worked well, it's not really that helpful in the development process, you need rapid feedback that your code is working on your target platforms (this is one reason why I love back-end development more than front-end development!)

Enter Docker and DNT

DNT: Docker Node Tester

Docker is a tool that simplifies the use of Linux containers to create lightweight, isolated compute "instances". Solaris and its variants have had this functionality for years in the form of "zones" but it's a fairly new concept for Linux and Docker makes the whole process a lot more friendly.

DNT contains two tools that work with Docker and Node.js to set-up containers for testing and run your project's tests in those containers.

DNT includes a setup-dnt script that sets up the most basic Docker images required to run Node.js applications, nothing extra. It first creates an image called "dev_base" that uses the default Docker "ubuntu" image and adds the build tools required to compile and install Node.js

Next it creates a "node_dev" image that contains a complete copy of the Node.js source repository. Finally, it creates a series of images that are required; for each Node version, it creates an image with Node installed and ready to use.

Setting up a project is a matter of creating a .dntrc file in the root directory of the project. This configuration file involves setting a NODE_VERSIONS variable with a list of all of the versions of Node you want to test against, and this can include "master" to test the latest code from the Node repository. You also set a TEST_CMD variable with a series of commands required to set up, compile and execute your tests. The setup-dnt command can be run against a .dntrc file to make sure that the appropriate Docker images are ready. The dnt command can then be used to execute the tests against all of the Node versions you specified.

Since Docker containers are completely isolated, DNT can run tests in parallel as long as the machine has the resources. The default is to use the number of cores on the computer as the concurrency level but this can be configured if not appropriate.

Currently DNT is designed to parse TAP test output by reading the final line as either "ok" or "not ok" to report test status back on the command-line. It is configurable but you need to supply a command that will transform test output to either an "ok" or "not ok" (sed to the rescue?).

How I'm using it

My primary use-case is for testing NAN. The test suite needs a lot of work so being able to test against all the different V8 and Node APIs while coding is super helpful; particularly when tests run so quickly! My NAN .dntrc file tests against master, all of the 0.11 releases since 0.11.4 (0.11.0 to 0.11.3 are explicitly not supported by NAN) and the last 5 releases of the 0.10 and 0.8 series. At the moment that's 17 versions of Node in all and on my computer the test suite takes approximately 20 seconds to complete across all of these releases.

The NAN .dntrc

NODE_VERSIONS="\
  master   \
  v0.11.9  \
  v0.11.8  \
  v0.11.7  \
  v0.11.6  \
  v0.11.5  \
  v0.11.4  \
  v0.10.22 \
  v0.10.21 \
  v0.10.20 \
  v0.10.19 \
  v0.10.18 \
  v0.8.26  \
  v0.8.25  \
  v0.8.24  \
  v0.8.23  \
  v0.8.22  \
"
OUTPUT_PREFIX="nan-"
TEST_CMD="\
  cd /dnt/test/ &&                                               \
  npm install &&                                                 \
  node_modules/.bin/node-gyp --nodedir /usr/src/node/ rebuild && \
  node_modules/.bin/tap js/*-test.js;                            \
"

Next I configured LevelDOWN for DNT. The needs are much simpler, the tests need to do a compile and run a lot of node-tap tests.

The LevelDOWN .dntrc

NODE_VERSIONS="\
  master   \
  v0.11.9  \
  v0.11.8  \
  v0.10.22 \
  v0.10.21 \
  v0.8.26  \
"
OUTPUT_PREFIX="leveldown-"
TEST_CMD="\
  cd /dnt/ &&                                                    \
  npm install &&                                                 \
  node_modules/.bin/node-gyp --nodedir /usr/src/node/ rebuild && \
  node_modules/.bin/tap test/*-test.js;                          \
"

Another native Node add-on that I've set up with DNT is my libssh bindings. This one is a little more complicated because you need to have some non-standard libraries installed before compile. My .dntrc adds some extra apt-get sauce to fetch and install those packages. It means the tests take a little longer but it's not prohibitive. An alternative would be to configure the node_dev base-image to have these packages to all of my versioned images have them too.

The node-libssh .dntrc

NODE_VERSIONS="master v0.11.9 v0.10.22"
OUTPUT_PREFIX="libssh-"
TEST_CMD="\
  apt-get install -y libkrb5-dev libssl-dev &&                           \
  cd /dnt/ &&                                                            \
  npm install &&                                                         \
  node_modules/.bin/node-gyp --nodedir /usr/src/node/ rebuild --debug && \
  node_modules/.bin/tap test/*-test.js --stderr;                         \
"

LevelUP isn't a native add-on but it does use LevelDOWN which requires compiling. For the DNT config I'm removing node_modules/leveldown/ prior to npm install so it gets rebuilt each time for each new version of Node.

The LevelUP .dntrc

NODE_VERSIONS="\
  master   \
  v0.11.9  \
  v0.11.8  \
  v0.10.22 \
  v0.10.21 \
  v0.8.26  \
"
OUTPUT_PREFIX="levelup-"
TEST_CMD="\
  cd /dnt/ &&                                                    \
  rm -rf node_modules/leveldown/ &&                              \
  npm install --nodedir=/usr/src/node &&                         \
  node_modules/.bin/tap test/*-test.js --stderr;                 \
#"

What's next?

I have no idea but I'd love to have helpers flesh this out a little more. It's not hard to imagine this forming the basis of a local CI system as well as a general testing tool. The speed even makes it tempting to run the tests on every git commit, or perhaps on every save.

If you'd like to contribute to development then please submit a pull request, I'd be happy to discuss anything you might think would improve this project. I'm keen to share ownership with anyone making significant contributions; as I do with most of my open source projects.

See the DNT GitHub repo for installation and detailed usage instructions.

LevelDOWN v0.10 / managing GC in native V8 programming

2013-11-18T00:00:00.000Z

Today we released version 0.10 of LevelDOWN. LevelDOWN is the package that directly binds LevelDB into Node-land. It's mainly C++ and is a fairly raw & direct interface to LevelDB. LevelUP is the package that we recommend most people use for LevelDB in Node as it takes LevelDOWN and makes it much more Node-friendly, including the addition of those lovely ReadStreams.

Normally I wouldn't write a post about a minor release like this but this one seems significant because of a number of small changes that culminate in a relatively major release.

In this post:

V8 Persistent references
Persistent in LevelDOWN; some removed, some added
Leaks!
Snappy 1.1.1
Some embarrassing bugs
Domains
Summary
A final note on Node 0.11.9

V8 `Persistent` references

The main story of this release are v8::Persistent references. For the uninitiated, V8 internally has two different ways to track "handles", which are references to JavaScript objects and values currently active in a running program. There are Local references and there are Persistent references. Local references are the most common, they are the references you get when you create an object or pass them around within a function and do the normal work that you do with an object. Persistent references are a special case that is all about Garbage Collection. An object that has at least one active Persistent reference to it is not a candidate for garbage collection. Persistent references must be explicitly destroyed before they release the object and make it available to the garbage collector.

Prior to V8 3.2x.xx (I don't know the exact version, does it matter? It roughly corresponds to Node v0.11.3.), these handles were both as easy as each other to create and interchange. You could swap one for the other whenever you needed. My guess is that the V8 team decided that this was a little too easy and that a major cause for memory leaks in C++ V8 code was the ease at which you could swap a Local for a Persistent and then forget to destroy the Persistent. So they tweaked the "ease" equation and it's become quite difficult.

Persistent and Local no longer share the same type hierarchy and the way you instantiate and assign a Persistent has become quite awkward. You now have to go through enough gymnastics to create a Persistent that it makes you ask the question: "Do I really need this to be a Persistent?" Which I guess is a good thing for memory leaks. NAN to the rescue though! We've somewhat papered over those difficulties with the capabilities introduced in NAN, it's still not as easy as it once was but it's not a total headache.

So, you understand v8::Persistent now? Great, so back to LevelDOWN.

`Persistent` in LevelDOWN; some removed, some added!

Some removed

Recently, Matteo noticed that when you're performing a Batch() operation in LevelDB, there is an explicit copy of the data that you're feeding in to that batch. When you construct a Batch operation in LevelDB you start off with a short string representing the batch and then build on that string as you build your batch with both Put() and Del() operations. You end up with a long string containing all of your write data; keys and values. Then when you call Write() on the Batch, that string gets fed directly into the main LevelDB store as a single write—which is where the atomicity of Batch comes from.

Both the chained-form and array-form batch() operations work this way internally in LevelDOWN.

However, with almost all operations in LevelDOWN, we perform the actual writes and reads against LevelDB in libuv worker threads. So we have to create the "descriptor" for work in the main V8 Node thread and then hand that off to libuv to perform the work in a separate thread. Once the work is completed we get the results back in the main V8 Node thread from where we can trigger a callback. This is where Persistent references come in.

Before we hand off the work to libuv, we need to make Persistent references to any V8 object that we want to survive across the asynchronous operation. Obviously the main candidate for this is callback functions. Consider this code:

db.get('foo', function (err, value) {
  console.log('foo = %s', value)
})

What we've actually done is create an anonymous closure for our callback. It has nothing referencing it, so as far as V8 is concerned it's a candidate for garbage collection once the current thread of execution is completed. In Node however, we're doing asynchronous work with it and need it to survive until we actually call it. This is where Persistent references come in. We receive the callback function as a Local in our C++ but then assign it to a Persistent so GC doesn't touch it. Once we're done our async work we can call the function and destroy the Persistent, effectively turning it back in to a Local and freeing it up for GC.

Without the Persistent then the behaviour is indeterminate. It depends on the version of V8, the GC settings, the workload currently in the program and the amount of time the async work takes to complete. If the GC is aggressive enough and has a chance to run before our async work is complete, the callback will disappear and we'll end up trying to call a function that no longer exists. This can obviously lead to runtime errors and will most likely crash our program.

In LevelDOWN, if you're passing in String objects for keys and values then to pull out the data and turn it in to a form that LevelDB can use we have to do an explicit copy. Once we've copied the data from the String then we don't need to care about the original object and GC can get its hands on it as soon as it wants. So we can leave String objects as Local references while we are building the descriptor for our async work.

Buffer objects are a different matter all together. Because we have access to the raw character array of a Buffer, we can feed that data straight in to LevelDB and this will save us one copy operation (which can be a significant performance boost if the data is significantly large or you're doing lots of operations—so prefer Buffers where convenient if you need higher perf). When building the descriptor for the async work, we are just passing a character array to the LevelDB data structures that we're setting up. Because the data is shared with the original Buffer we have to make sure that GC doesn't clean up that Buffer before we have a chance to use the data. So we make a Persistent reference for it which we clean up after the async work is complete. So you can do this without worrying about GC:

db.put(
    new Buffer('foo')
  , require('crypto').randomBytes(1024)
  , function (err) {
      console.log('foo is now some random data!')
    }
)

This has been the case in LevelDOWN for all operations since pretty much the beginning. But back to Matteo's observation. If LevelDB's data structures perform an explicit copy on the data we feed it then perhaps we don't need to keep the original data safe from GC? For a batch() call it turns out that we don't! When we're constructing the Batch descriptor, as we feed in data to it, both Put() and Del(), it's taking a copy of our data to create its internal representation. So even when we're using Buffer objects on the JavaScript side, we're done with them before the call down in to LevelDOWN is completed so there's no reason to save a Persistent reference! For other operations we're still doing some copying during the asynchronous cycle but the removal of the overhead of creating and deleting Persistent references for batch() calls is fantastic news for those doing bulk data loading (like Max Ogden's dat project which needs to bulk load a lot of data).

Some added

Another gem from Matteo was reports of crashes during certain batch() operations. Difficult to reproduce and only under very particular circumstances, it seems to be mostly reproducible by the kinds of workloads generated by LevelGraph. Thanks to some simple C++ debugging we traced it to a dropped reference, obviously by GC. The code in question boiled down to something like this:

function doStuff () {
  var batch = db.batch()
  batch.put('foo', 'bar')
  batch.write(function (err) {
    console.log('done', err)
  })
}

In this code, the batch object is actually a LevelDOWN Batch object created in C++-land. During the write() operation, which is asynchronous, we end up with no hard references to batch in our code because the JS thread has yieled and moved on and the batch is contained within the scope of the doStuff() function. Because most of the asynchronous operations we perform are relatively quick, this normally doesn't matter. But for writes to LevelDB, if you have enough data in your write and you have enough data already in your data store, you can trigger a compaction upstream which can delay the write which can give V8's GC time to clean up references that might be important and for which you have no Persistent handles.

In this case, we weren't actually creating internal Persistent references for some of our objects. Batch in this case but also Iterator. Normally this isn't a problem because to use these objects you generally keep references to them yourself in your own code.

We managed to debug Matteo's crash by adjusting his test code to look something like this and watching it succeed without a crash:

function doStuff () {
  var batch = db.batch()
  batch.put('foo', 'bar')
  batch.write(function (err) {
    console.log('done', err)
    batch.foo = 'bar'
  })
}

By reusing batch inside our callback function, we're creating some work that V8 can't optimise away and therefore has to assume isn't a noop. Because the batch variable is also now referenced by the callback function and we already have an internal Persistent for it, GC has to pass over batch until the Persistent is destroyed for the callback.

So the solution is simply to create a Persistent for the internal objects that need to survive across asynchronous operations and make no assumptions about how they'll be used in JavaScript-land. In our case we've gone for assigning a Persistent just prior to every asynchronous operation and destroying it after. The alternative would be to have a Persistent assigned upon the creation of objects we care about but sometimes we want GC to do its work:

function dontDoStuff () {
  var batch = db.batch()
  batch.put('foo', 'bar')
  // nothing else, wut?
}

I don't know why you would write that code but perhaps you have a use-case where you want the ability to start constructing a batch but then decide not to follow through with it. GC should be able to take care of your mess like it does with all of the other messes you create in your daily adventures with JavaScript.

So we are only assigning a Persistent when you do a write() with a chained-batch operation in LevelDOWN since it's the only asynchronous operation. So in dontDoStuff() GC will come along and rid us of batch, 'foo' and 'bar' when it has the next opportunity and our C++ code will have the appropriate destructors called that will clean up any other objects we have created along the way, like the internal LevelDB Batch with its copy of our data.

Leaks!

We've been having some trouble with leaks in LevelUP/LevelDOWN lately (LevelDOWN/#171, LevelGraph/#40). And it turns out that these leaks aren't related to Persistent references, which shouldn't be a surprise since it's so easy to leak with non-GC code, particularly if you spend most of your day programming in a language with GC.

With the help of Valgrind we tracked the leak down to the omission of a delete in the destructor of the asynchronous work descriptor for array-batch operations. The internal LevelDB representation of a Batch wasn't being cleaned up unless you were using the chained-form of LevelDOWN's batch(). This one has been dogging us for a few releases now and it's been a headache particularly for people doing bulk-loading of data so I hope we can finally put it behind us!

Snappy 1.1.1

Google released a new version of Snappy, version 1.1.1. I don't really understand how Google uses semver; we get very simple LevelDB releases with the minor version bumped and then we get versions of Snappy released with non-trivial changes with only the patch version bumped. I suspect that Google doesn't know how it uses semver either and there's no internal policy on it.

Anyway, Snappy 1.1.1 has some fixes, some minor speed and compression improvements but most importantly it breaks compilation on Windows. So we had to figure out how to fix that for this release. Ugh. I also took the opportunity to clean up some of the compilation options for Snappy and we may see some improvements in the way it works now... perhaps.

Some embarrassing bugs

Amine Mouafik is new to the LevelDOWN repository but has picked up some rather embarrassing bugs/omissions that are probably my fault. It's great to have more eyes on the C++ code, there's not enough JavaScript programmers with the confidence to dig in to messy C++-land.

Firstly, on our standard LevelDOWN releases, it turns out that we haven't actually been enabling the internal bloom filter. The bloom filter was introduced in LevelDB to speed up read operations to avoid having to scan through whole blocks to find the data a read is looking for. So that's now enabled for 0.10.

Then he discovered that we had been turning off compression by default! I believe this happened with the the switch to NAN. The signature for reading boolean options from V8 objects was changed from internal LD_BOOLEAN_OPTION_VALUE & LD_BOOLEAN_OPTION_VALUE_DEFTRUE macros for defaulting to true and false respectively when the options aren't supplied, to the NAN version which is a unified NanBooleanOptionValue which takes an optional defaultValue argument that can be used to make the default true. This happened at roughly Node version 0.11.4.

Well, this code:

bool compression =
    NanBooleanOptionValue(optionsObj, NanSymbol("compression"));

is now this:

bool compression =
    NanBooleanOptionValue(optionsObj, NanSymbol("compression"), true);

so if you don't supply a "compression" boolean option in your db setup operation then it'll now actually be turned on!

Domains

We've finally caught up with properly supporting Node's domains by switching all C++ callback calls from standard V8 callback->Call(...) to Node's own node::MakeCallback(callback, ...) which does the same thing but also does lots of additional things, including accounting for domains. This change was also included in NAN version 0.5.0.

Summary

Go and upgrade!

leveldown@0.10.0 is packaged with the new levelup@0.18.0 and level@0.18.0 which have their minor versions bumped purely for this LevelDOWN release.

Also released are the packages:

leveldown-hyper@0.10.0
leveldown-basho@0.10.0
rocksdb@0.10.0 (based on the same LevelDOWN code) (Linux only)
level-hyper@0.18.0 (levelup on leveldown-hyper)
level-basho@0.18.0 (levelup on leveldown-basho)
level-rocks@0.18.0 (levelup on rocksdb) (Linux only)

I'll write more about these packages in the future since they've gone largely under the radar for most people. If you're interested in catching up then please join ##leveldb on Freenode where there's a bunch of Node database people and also a few non-Node LevelDB people like Robert Escriva, author of HyperLevelDB and all-round LevelDB expert.

A final note on Node 0.11.9

There will be a LevelDOWN@0.10.1 very soon that will increment the NAN dependency to 0.6.0 when it's released. This new version of NAN will specifically deal with Node 0.11.9 compatibility where there are more breaking V8 changes that will cause compile errors for any addon not taking them in to account. So if you're living on the edge in Node then we should have a release soon enough for you!

All the levels!

2013-10-09T00:00:00.000Z

When we completely separated LevelUP and LevelDOWN so that installing LevelUP didn't automatically get you LevelDOWN, we set up a new package called Level that has them both as a dependency so you just need to do var level = require('level') and everything is done for you.

But, we now have more than just the vanilla (Google) LevelDB in LevelDOWN. We also have a HyperLevelDB version and a Basho fork. These are maintained on branches in the LevelDOWN repo and are usually released now every time a new LevelDOWN is released. They are called leveldown-hyper and leveldown-basho in npm but you need to plug them in to LevelUP yourself to make them work. We also have Node LMDB that's LevelDOWN compatible and a few others.

So, as of today, we've released a new, small library called level-packager that does this bundling process so that you can feed it a LevelDOWN instance and it'll return a Level-type object that can be exported from a package like Level. This is meant to be used internally and it's now being used to support these new packages that are available in npm:

level-hyper bundles the HyperLevelDB version of LevelDOWN with LevelUP
level-basho bundles the Bash fork of LevelDB in LevelDOWN with LevelUP
level-lmdb bundles Node LMDB with LevelUP

The version numbers of these packages will track the version of LevelUP.

So you can now simply do:

var level = require('level-hyper')
var db = level('/path/to/db')
db.put('foo', 'woohoo!')

If you're already using Level then you can very easily switch it out with one of these alternatives to try them out.

Both HyperLevelDB and the Basho LevelDB fork are binary-compatible with Google's LevelDB, with one small caveat: with the latest release, LevelDB has switched to making .ldb files instead of .sst files inside a data store directory because of something about Windows backups (blah blah). Neither of the alternative forks know anything about these new files yet so you may run in to trouble if you have .ldb files in your store (although I'm pretty sure you can simply rename these to .sst and it'll be fine with any version).

Also, LMDB is completely different to LevelDB so you won't be able to open an existing data store. But you should be able to do something like this:

require('level')('/path/to/level.db').createReadStream()
  .pipe(require('level-lmdb')('/path/to/lmdb.db').createWriteStream())

Whoa...

A note about HyperLevelDB

Lastly, I'd like to encourage you to try the HyperLevelDB version if you are pushing hard on LevelDB's performance. The HyperDex fork is tuned for multi-threaded access for reads and writes and is therefore particularly suited to how we use it in Node. The Basho version doesn't show much performance difference mainly because they are optimising for Riak running 16 separate instances on the same server so multi-threaded access isn't as interesting for them. You should find significant performance gains if you're doing very heavy writes in particular with HyperLevelDB. Also, if you're interested in support for HyperLevelDB then pop in to ##leveldb on Freenode and bother rescrv (Robert Escriva), author of HyperLevelDB and our resident LevelDB expert.

It's also worth nothing that HyperDex are interested in offering commercial support for people using LevelDB, not just HyperLevelDB but also Google's LevelDB. This means that anyone using either of these packages in Node should be able to get solid support if they are doing any heavy work in a commercial environment and need the surety of experts behind them to help pick up the pieces. I imagine this would cover things like LevelDB corruption and any LevelDB bugs you may run in to (we're currently looking at a subtle batch-related LevelDB bug that's come along with the 1.14.0 release, they do exist!). Talk to Robert if you want more information about commercial support.

Should I use a single LevelDB or many to hold my data?

2013-10-03T00:00:00.000Z

This is a long overdue post, so long in fact that I can't remember who I promised to do this for! Regardless, I keep on having discussions around this topic so I thought it worthwhile putting down some notes on what I believe to be the factors you should consider when making this decision.

What's the question?

It goes like this: You have an application that uses LevelDB, in particular I'm talking about Node.js applications here but the same would apply if you're using LevelUP in the browser and also most of the other back-ends for LevelUP. And you invariably end up with different kinds of data, sometimes the kinds of data you're storing is so different that it feels strange putting them into the same storage blob. Often though, you just have sets of not-very-related data that you need to store and you end up having to make a decision: do I put everything into a single LevelDB store or do I put things into their own, separate, LevelDB store?

This stuff doesn't belong together!

Coming from an relational database background, it took me a little while to displace the concept of discrete tables with the notion of namespacing within the same store. I can understand the temptation to want to keep things separate, not wanting to end up with a huge blob of data that just shouldn't be together. But this isn't the relational database world and you need to move on!

We have a set of LevelUP addons, such as sublevel, that exist mainly to provide you with the comfort of being able to separate your data by whatever criteria makes sense. bytewise is another tool that can serve a similar purpose and some people even use sublevel and bytewise together to achieve more complex organisation.

We have the tools at our disposal in Node.js to turn a one-dimensional storage array into a very complex, multidimensional storage system where unrelated, and semi-related data can coexist. So, if the only reason you want to store things in separate stores is because it just feels right to do so, you should probably be looking at what's making you think that way. You may need to update your assumptions.

Technical considerations

That aside, there are some technical considerations for making this decision:

Size and performance

To be clear, LevelDB is fast and it can also store lots of data, it'll handle Gigabytes of data without too much sweat. However, there are some performance concerns when you start getting in to the Gigabyte range, mainly when you're trying to push data in at a high rate. Most use-cases don't do this so be honest about your performance needs. For most people LevelDB is simply fast.

However, if you do have a high-throughput scenario involving a large amount of data that you need to store then you may want to consider having a separate store to deal with the large data and another one to deal with the rest of your data so the performance isn't impacted across the board.

But again, be honest about what your workload is, you're probably not pushing Voxer amounts of data so don't prematurely optimise around the workload you'd like to think you have or are going to have one day in the distant future.

Cache

Caching is transparent by default with LevelDB so it's easy to forget about it when making these kinds of decisions but it's actually quite important for this particular question.

By default, you have an 8M LRU cache with LevelDB and all reads use that cache, for look-ups and also for updating with newly read values. So, you can have a lot of cache-thrash unless you're reading the same values again and again.

But, there is a fillCache (boolean) option for read operations (both get() and createReadStream() and its variations). So you can set this to false where you know you won't be needing fast access to those entries again and you don't want to push out other entries from the LRU.

So caching strategies can be separate for different types of data and are not a strong reason to keep things in a separate data store.

I always recommend that you should tinker with the cacheSize option when you're using LevelDB, it can be as large as you want to fit in the available memory of your machine. As a rule of thumb, somewhere between 2/3 and 3/4 of the available memory should be a maximum if you can afford it.

Consider though what happens if you are using separate LevelDB stores, you now have to deal with juggling cacheSize between the stores. Often, you're probably going to be best served by having a single, large cache that can operate across all your data types and let the normal behaviour of your application determine what gets cached with occasional reliance on 'fillCache': false to fine-tune.

Consistency

As I discussed in my LXJS talk, the atomic batch is an important primitive for building solid database functionality with inherent consistency. When you're using sublevel, even though you have what operate like separate LevelUP instances for each sublevel, you still get to perform atomic batch operations between sublevels. Consider indexing where you may have a primary sublevel for the entries you're writing and a secondary sublevel for the indexing data used to reference the primary data for lookups. If you're running these as separate stores then you lose the benefits of the atomic batch, you just can't perform multiple operations with guaranteed consistency.

Try and keep the atomic batch in mind when building your application, instead of accepting the possibility of inconsistent state, use the batch to keep consistency.

Back-end flexibility

OK, this one is a bit left-field, but remember that LevelUP is back-end-agnostic. It's inspired by LevelDB but it doesn't have to be Google's LevelDB that's storing data for you. It could be Basho's fork or HyperLevelDB. It could even be LMDB or something a little crazy like MemDOWN or mysqlDOWN!

If you're at all concerned about performance, and most people claim to be even though they're not building performance-critical applications, then you should be benchmarking your particular workload against your storage system. Each of the back-ends for LevelUP have different performance characteristics and different trade-offs that you need to understand and test against your needs. You may find that one back-end works for one kind of data in your application and another back-end works for another.

Summary

The TL;DR is: in most cases, a single LevelDB store is generally preferable unless you have a real reason for having separate ones.

Have I missed any considerations that you've come across when making this choice? Let me know in the comments.

Primitives for JS Databases (an LXJS adventure)

2013-10-03T00:00:00.000Z

I gave a talk yesterday at LXJS yesterday in the "Infrastructure.js" block and tried to talk about JavaScript Database Primitives; i.e. the basic building blocks we have landed on for building more complex database solutions in JavaScript.

The talk certainly wasn't as good or clear as I wanted it to be, it worked much better in my head! A huge venue with over 300 talented JavaScripters, an absolutely massive screen, bright lights and loud amplification got the better of me and I wasn't able to pull the material together how I wanted to. The introvert within me is telling me to become a recluse for a little while just to recover! My hope is that at least one or two people are inspired to give database hacking a go because it's really not that difficult once you get your head around the primitives.

Edit: I wasn't trying to elicit sympathy here, I genuinely think that I wasn't clear on what I was trying to communicate. It went so well in my head, as it usually does, but I fell far short of what I wanted to express. I'll attempt to rectify some of that with a writeup (see next para).

Thankfully though, a portion of the material will be able to serve as the basis for the, long overdue, third part in my three part DailyJS series on LevelDB & Node.

In summary, inspired by LevelDB, we've ended up with a core set of primitives in LevelUP that can be used to build feature-rich and advanced database functionality. Atomic batch and ReadStream are the two non-trivial primitives, open, close, get, put, del are all pretty easy to understand as primitives, although del is perhaps redundant but we're opting for explicitness.

My slides are online but hopefully I'll be able to get my DailyJS article sorted out soon and I'll be able to explain what I was trying to get at.

ReadStream as a primitive query mechanism is not too hard to understand once you get your head around key sorting and the implications for key structure. Batch is a little more subtle and relates to consistency and our ability to augment basic operations to create more complex functionality while keeping the data store in a consistent state.

I additionally raised "Buckets", or "Namespaces" as a primitive concept and discussed how sublevel has effectively become the standard for turning a one-dimensional data store into a multi-dimensional store able to encapsulate contain sophisticated functionality behind what is essentially just a key/value store.

Thanks to the LXJS team

It would be neglectful of me to not say how absolutely grateful I am to the LXJS team for putting so much effort into taking care of speakers; fantastic job.

LXJS is an amazing event, put on by a dedicated and very talented team of people committed to the JavaScript community and the JavaScript community in Portugal in particular. This conference sets a very high bar for community-driven conferences with the way it has managed to get so many locals (and internationals!) involved in running an event in their own time.

David Dias, Ana Hevesi, Pedro Teixeira, Luís Reis, Nuno Job, Tiago Rodrigues, Leo Xavier, Alexander Kustov, André Rodrigues and Bruno Coelho have managed to put on an amazing event and are some of the nicest and talented people I've met. Thank you to you all and everyone else who put on LXJS 2013, your hard work is appreciated and should be an inspiration to everyone involved in our local JavaScript communities, running events or considering running events like this.

NodeConf.eu

2013-09-27T00:00:00.000Z

Wow, NodeConf.eu was certainly a once-in-a-lifetime event ... although there's talk of a repeat performance next year (don't miss the chance when it comes around!).

Dominic Tarr, @substack and Julian Gruber raising the NodeConf.eu flag

NodeConf.eu was held in Waterford, Ireland, on an Island, in a Castle and was organised by the Node lovin' company, nearForm, in particular Cian O'Maidin and his amazing assistant Catherine Bradley. Of course Mikeal Rogers had a significant role in organising the event too.

Waterford Castle

The welcome banquet ... yep

Instead of describing the talks, I'll defer to the excellent four part series by Paul, Adam, Luke and Ben of Clock where you'll find a great summary of the talks and events of the conference.

For my part, I was deeply honoured to be involved in the "Node Databases" track of the conference. We started off the NodeConf.eu talks with a 3-part show. My talk was titled "A Real Database Rethink" and was followed by Dominic Tarr who talked more about the Level* ecosystem and the various pieces of the Node Databases puzzle that's being built. Julian Gruber then closed us off with some amazing live-coding of some browser/server streaming LevelUP/multilevel wizardry.

A Real Database Rethink

The slides of my talk are online. I attempted to break down the definition of the term "database" by looking at where the concept comes from historically. It's actually a difficult thing to define and I don't believe there is any one agreed upon meaning. What I came up with is:

A tool for interacting with structured data, externalised from the core of our application

Persistence

Performance

Simplify access to complex data

And sometimes...

Shared access

Scalability

But even that's pretty rough.

Taking that definition, we can apply Node philosophy of small-core and vibrant user-land, along with the culture of extreme modularity afforded us by npm, and build a new kind of database; or at least apply new thinking to the "database".

The bulk of my talk was taken up with talking about LevelUP and the basics of the Level ecosystem. There's a table on slide #7 that I'm going to try and refine over time to help describe what the Level / NodeBase world is all about.

Level Me Up Scotty!

One of the three workshops available at NodeConf.eu was all about Node Databases. I took the same approach as at CampJS recently where I built Learn You The Node.js For Much Win!, a tool that owes a debt to stream-adventure, a self-guided workshop-in-your-terminal application by @substack and Max Ogden written for NodeConf (US).

This time around, I received some great help from both @substack and Julian Gruber who helped write some exercises, I also received help from Eugene Ware who wasn't even at the conference but was assisting with development from Australia. Raynos was also a great help in getting the application working well.

We ended up with Level Me Up Scotty!, or just levelmeup.

Dominic Tarr, Thorsten Lorenz, Paolo Fragomeni, Matteo Collina, Magnus Skog, Max Ogden and other experienced Levelers helped on and off while the workshops were happening; so we had plenty of expertise at hand whenever there were questions.

Workshops were unstructured and the organisers of each workshop all ended up agreeing that we should just let people come and go as they pleased. This suited us as the workshop was open-ended and designed not to be finished by most people within the originally planned hour (I think that was the original plan).

levelmeup is installed from npm (npm install levelmeup -g) and is fully self-guided. You run the levelmeup application and it steps you through some exercises designed to:

introduce you to the format of the workshops with a simple "Hello World" style exercise
introduce you to LevelUP and its basic operations
help you understand ReadStream and the range-queries it makes possible
encourage creative thought regarding key structure
introduce sublevel
introduce multilevel

There's more planned for the future of this workshop application too, Matteo even has an a work-in-progress exercise that should be merged fairly soon.

nodeschool.io was hatched from NodeConf.eu and pulls together the three workshop applications currently available in npm. I believe this was an initiative of Brian J. Brennan and other Mozillans on the Open Badges project. workshopper is the engine that runs both learnyounode and levelmeup and we're trying to make it even easier for others to author their own workshop applications. There is already a Functional Javacript Workshop by Tim Oxley and there are more in development. Exciting times!

Workshoppers stretching their brains with levelmeup

My experience with stream-adventure and learnyounode suggested that this format should prove to be relatively successful but ultimately I think we had most of the attendees come through at some point and sit down to have a crack at the workshop. This is particularly impressive given that Emily Rose, Elijah Insua and Matteo were running a NodeBots workshop which included Arduino and NodeCopter hacking (always popular!). And Max Bruning and TJ Fontaine were running a Manta / MDB / DTrace / SmartOS-magic workshop and their material was some of my favourite from NodeConf (US) so I'm sure people really enjoyed what they had to present.

Unfortunately I didn't get to attend these other workshops, I also missed out on some skeet!

Karolina "don't mess with me" Szczur, photo by Matthew Bergman

But there was plenty of other experience to be had. It was also fantastic to meet so many people I only knew from IRC / Twitter / GitHub. For someone who lives in regional Australia and doesn't get a chance to socialise much with other nerds, this was a particularly special opportunity.

Final night banquet shenanigans with Charlie McConnell and @substack ... the napkin hat thing is a story in itself, blame Jessica Lord, photo by Matthew Bergman

The Level* Gang

As an aside, NodeConf.eu had the largest concentration of LevelUP contributors and active Level* developers of any event that I'm aware of so far. So we took the opportunity to have our own little meeting. We even took minutes, of sorts.

There has been a long-standing plan to make a Level* / NodeBase website but being the disorganised rabble we are, it hasn't got off the ground. Karolina (and Jessica too I believe) are keen to help out on the design end but just need the content. So that's what we planned. There's a bunch of issues that form a TODO in the repo for this project. Hopefully we can all get on top of it sooner rather than later. We're also open to assistance from anyone else that would like to contribute.

Besides getting stuff done, it was just a pleasure to hang out with these people and talk shop.

The Level* Gang: Paolo, Dominic, @substack, Karolina, Magnus, Mikeal, Julian, Max, Matteo and Paul Fryzel. Raynos was around but missed this particular event, Thorsten was inside demoing his guitar-typing software.

Learn You The Node.js

2013-08-14T00:00:00.000Z

CampJS has just finished, with a bigger crowd than last time around. It was lots of fun, and as usual, these events are more about meeting the people I collaborate, and socialise with online than anything else. There was a particularly large turn-out of the hackers on #polyhack, our Australian programmers channel on Freenode. Even @mwotton, our resident Haskell-troll was there! Lots of photos and news can be found on Storify. The next one will likely be near Melbourne in February some time and I highly recommend it if you can get there.

Learn You The Node.js For Much Win (presentation)

I was struck last CampJS how many JavaScript newbies were there, or at least people who deal with JavaScript as a secondary language and therefore only have a cursory understanding of it. And by extension, there were not many people who had much understanding of Node. So I wanted to present some intro-to-Node material this time.

I gave a 30 minute talk covering the very basics of what Node is, called Learn You The Node.js For Much Win. Obviously the title is inspired by Learn You a Haskell For Great Good and Learn You Some Erlang For Great Good. You can find my slides here (feel free to rip them off if you need to give a similar talk somewhere!). The video may be online at some point in the future.

Learn You The Node.js For Much Win (workshop)

The next morning, I gave a workshop on the same topic but it was much more hands-on. The inspiration for my workshop came from NodeConf, a couple of months earlier. @substack and @maxogden presented a workshop titled stream adventure which was a self-guided, interactive workshop for the terminal, built with Node. You can find it here and install it from npm with npm install stream-adventure -g, I highly recommend it.

I was so inspired that I stole their code and made my own workshop application! learnyounode. You can download and install it with npm install learnyounode -g.

The application itself is/was a series of 13 separate workshops. Starting off with a simple HELLO WORLD and ending with a JSON API HTTP server (contributed by the very clever @sidorares).

Nobody actually managed to finish the workshops in the allotted 60 minutes, although @alexdickson, an expert JavaScripter but Node-n00b was the first one I heard of finishing it not long after.

The workshops attempt to focus on some of the core concepts of Node. There's lots of console output because that's easiest to validate but it introduces filesystem I/O, both synchronous and asynchronous and moves straight on to networking because that's what Node is so good at. An HTTP CLIENT example, introduces HTTP and is expanded on in HTTP COLLECT which introduces streams. JUGGLING ASYNC builds on HTTP COLLECT to introduce the complexities of managing parallel asynchronous activities. From there, it switches from network clients to network servers, first a simple TCP server in TIME SERVER and then using streams to serve files in HTTP FILE SERVER and transforming data with HTTP UPPERCASERER. The final exercise presents you with a more complex, closer-to-real-world example, an HTTP API server with multiple end-points.

The entire workshop is designed to take longer than 1-hour, people ought to be able ot take it away and complete it later. It's also designed to be suitable for complete n00bs and also people with some experience, it ought to make a fun challenge for anyone already experienced with Node to see how quickly they can complete the examples (I believe I earned the honour of being the first person at NodeConf to finish stream-adventure in the allotted time!).

The Node-experts at CampJS were thankfully helping out during the workshop so there wasn't much competition going on there.

Many thanks to these expert Node nerds who hovered and helped people during the workshop and also did some test-driving of the workshop prior to the event:

Nicholas Faiz
Christopher Giffard
Tim Oxley (who also poured his heart and soul into organising CampJS)
Conrad Pankoff
Andrey Sidorov (who also contributed the final exercise of the workshop)
Eugene Ware (who was also brilliant all weekend, running the local sneakernet because the network was so flakey)

(I really hope I haven't missed anyone out there; so many quality nerds at CampJS!)

Tim Oxley making a contribution during the workshop, along with Christopher Giffard (left) and Eugene Ware (right)

I had the solutions to the workshop ready on the big-screen and walked through some of the early solutions and talked through what was going on. I didn't expect many people to listen to those bits and the workshop was designed so you could totally zone-out and do it at your own pace if that suited.

If anyone wants to run a similar style workshop for their local meet-up, using the same content, I'd love to receive contributions to learnyounode. Alternatively, make your own! I extracted the core framework from learnyounode and it now lives separately as workshopper.

I would love feedback from anyone in attendance or anyone that uses this tool to run their own workshops! learnyounode is already listed in Max Ogden's excellent The Art of Node, so I'm looking forward to contributions to help turn this into a really useful teaching tool.

LevelDOWN Alternatives

2013-06-07T00:00:00.000Z

Since LevelUP v0.9, LevelDOWN has been an optional dependency, allowing you to switch in alternative back-ends.

We have MemDOWN, a pure in-memory data-store, allowing you to run LevelUP against transient, and very fast storage.

We also have level.js which works against IndexedDB, allowing you to run LevelUP in the browser!

Since LevelUP just needs some basic primitives and a sorted bi-directional iterator, we can swap out the back-end with numerous alternatives.

The easy targets are the forks of LevelDB that purport to fix or improve LevelDB in some way. I have another post brewing on what I think about the claims made in this area and how we ought to approach them, but that can come later. For now I have some packages in npm for you to try for yourself!

Basho

First of all we have leveldown-basho which bundles the Basho LevelDB fork into LevelDOWN. See Matthew Von-Maszewski's slides from the recent Ricon East 2013 for more information on what they've tried to do with LevelDB.

In summary, Basho's aim is to optimise LevelDB "for the server", particularly for high write throughput. They use >1 compaction threads and relax the rules a little on overlapping keys for the lower levels. Plus a few other things that I won't get in to here.

$ npm install levelup leveldown-basho

var levelup = require('levelup')
  , leveldown = require('leveldown-basho')
  , db = levelup('/path/to/db', { db: leveldown })  
// go to work on `db`

Disclaimer: some of the LevelDOWN and LevelUP tests are failing on the current build for this release, although I don't believe they should impact on standard usage but your mileage may vary...

HyperDex

Next, we have leveldown-hyper, which bundles a fork by the people behind HyperDex, a key-value store. Again their aim is to optimise LevelDB for a server environment. You can see some of their claims about performance here. I don't know as much about this fork, I'll investigate further when I have time, but they are also using multiple compaction threads to do the background work.

$ npm install levelup leveldown-hyper

var levelup = require('levelup')
  , leveldown = require('leveldown-hyper')
  , db = levelup('/path/to/db', { db: leveldown })  
// go to work on `db`

Lies! Benchmarks!

OK, benchmarks kind of suck, particularly microbenchmarks. It's really hard to test something that's meaningful for everyone's use-case. But you can make pretty pictures with them and they can tell something of a story, even if it's just the first page of a novel.

So here we go. I've put together a simplistic benchmark that tries to test the kind of situation that these two forks are aiming to optimise for. In particular, high-throughput writes. There's a common claim that LevelDB has problems with writes because the compaction thread can hold up levels 0 and 1 while it's working on higher levels; and you really want to be flushing the new data as soon as possible so you can get more in. (Again, I have more to say on this & the claims about "fixing" the problem in a later post).

I have a sorted-write benchmark in the LevelDOWN repo that tries to push in 10M pre-sorted entries as fast as possible, fully utilising Node's worker-threads for the job. So this isn't your typical browser scenario. An important point here is that Node is a unique environment when looking at LevelDB performance. It's not going to be a straightforward mapping of benchmark results obtained with other LevelDB bindings onto what we can achieve in Node.

Because there are so many entries, instead of recording the time for individual writes, I've recorded average time for batches of 1000 writes. Below you can see what the write-times look like when plotted over time. There are a bunch of outliers that are above the maximum Y of 0.6ms, but not enough to warrant distracting from the interesting behaviour below 0.6ms so I chopped it off there.

It is important to note that I'm using the default options here and this is where I'll probably cop some flak. Basho in particular advocate a healthy amount of "tuning" to achieve appropriate performance. In particular the write-buffer defaults to only 4M and you can push data in faster (at the cost of compactions later on) by increasing this. I think the forks may even have additional tunables of their own that you can fiddle with. But, this whole tuning thing is a rabbit hole I don't dare go down right now!

I'm running this on an i7-2630QM CPU, plenty of RAM and an SSD.

You can see that we managed to push in the 10M entries in just over 95 seconds with the plain Google LevelDB (v1.10.0).

Next up we have the HyperDex fork. The main difference here is that we have it working slightly faster in total and the write-times have been trimmed down a bit to be more consistent. Not a bad effort with default settings, quite a nice picture.

Lastly we can see what Basho have done. They've been on this case for a lot longer than HyperDex have and their fork, internally at least, diverges quite a bit from Google's LevelDB.

We can see that the write-time has been considerably flattened; which is in line with what Basho claim and are aiming for, the consistency here is very impressive. Unfortunately we've ended up with a total time that is double what it took Google's LevelDB to get the 10M entries in!

No doubt this is probably something to do with the tunables, or perhaps I've messed something up, anything's possible!

So?

If you take anything away from this, here's what I think it should be: Do your own benchmarks if performance really is an issue for you. You're going to need some kind of benchmark suite that is tailored to your particular application. This will not only let you choose the appropriate storage system but it will give you something to work with when you start to get in to the mire that is "tunables".

It's likely I won't be able to leave this alone and will be posting more benchmarks with some tweaking and tuning. I'd love to have input from others on this too of course! The code for this is all in the LevelDOWN repo with both of these forks under appropriately named branches.

LevelUP v0.9 Released

2013-05-21T00:00:00.000Z

As per my previous post, LevelUP v0.9 has been released!

I'm doing a quick post about this release because it's got more changes in it than we normally see, including some things worth explaining.

Relationship to LevelDOWN

The biggest change is the removal of LevelDOWN as a dependency, you should review what I've already said about this as this will impact you if you're currently using LevelUP. In short, you'll either need to explicitly npm install leveldown or switch to using the new Level package which bundles them both.

Along with this change, we also get better Browserify support. See level.js for more information on this.

Chained batch

The other major change is the introduction of a new chained batch syntax, additional to the existing batch syntax. This method of creating and writing batch operations is much closer to the way LevelDB does batches and under certain circumstances you may find improved performance from using this method.

If you call db.batch() with no arguments, you'll get a Batch object back which has the following operations: put(), del(), clear() and write(). The first three are chainable so you can call them one after the other to build your batch. write() is the only method that takes a callback because it submits the batch. Until you call write(), the batch is transient and can be discarded.

Example from the README:

db.batch()
  .del('father')
  .put('name', 'Yuri Irsenovich Kim')
  .put('dob', '16 February 1941')
  .put('spouse', 'Kim Young-sook')
  .put('occupation', 'Clown')
  .write(function () { console.log('Done!') })

Some love for WriteStream

WriteStream got some attention in this release. On the main createWriteStream() method and on individual write() calls, you can now pass some new options:

'type' can switch from the default 'put' to 'del' so you can make a WriteStream that only deletes when you write({ key: 'foo' }), or you can make individual writes delete: write({ type: 'del', key: 'foo' }).
'keyEncoding' and 'valueEncoding' will switch from default encodings for the current LevelUP instance. Again, you can specify them on the main createWriteStream() or on individual write() calls.

Other changes

A race condition was fixed that allowed a put() to write to the store before an iterator was obtained when calling `createReadStream().
ReadStream no longer emits a 'ready' event.
The db property on LevelUP instances can be used to get access to LevelDOWN or whatever LevelDOWN-substitute you are using (this was _db).
Some very LevelDB-specific methods have been deprecated on LevelUP and the documentation now recommends either directly using LevelDOWN or calling via the db property. Specifically:
- db.db.approximateSize()
- leveldown.repair()
- leveldown.destroy()
LevelDOWN got a new LevelDB method: getProperty() that currently understands 3 properties:
- db.db.getProperty('leveldb.num-files-at-levelN'): returns the number of files at level N, where N is an integer representing a valid level (e.g. "0")').
- db.db.getProperty('leveldb.stats'): returns a multi-line string describing statistics about LevelDB's internal operation.
- db.db.getProperty('leveldb.sstables'): returns a multi-line string describing all of the sstables that make up contents of the current database.
Significantly improved ReadStream performance improvements (up to 50% faster).
Some LevelDOWN memory leaks discovered and fixed.
LevelDOWN upgraded to LevelDB@1.10.0, details here.

Who you should thank

A lot of people put in work to this release. There's a team of people that can claim ownership of LevelUP, LevelDOWN and related projects and most of them have been involved in this release. You should follow these people on Twitter and GitHub!

Dominic Tarr (GitHub/dominictarr / Twitter/@dominictarr) contributed to the ReadStream fixes and is just a generally valuable & awesome sage in the LevelDB + Node community.
Julian Gruber (GitHub/juliangruber / Twitter/@juliangruber) contributed the encoding options for WriteStreams and most of the work on the new chained batch().
Matteo Collina (GitHub/mcollina / Twitter/@matteocollina) contributed the 'type' options for WriteStreams and most of the work on performance improvements to ReadStreams.
David Björklund (GitHub/kesla / Twitter/@david_bjorklund) also contributed work on ReadStream performance.
Max Ogden (GitHub/maxogden / Twitter/@maxogden) and Anton Whalley (GitHub/No9 / Twitter/@antonwhalley) both worked on extracting most of the LevelDOWN test suite into AbstractLevelDOWN to form a LevelDOWN-spec that's also runnable in browser environments.

And others, who you can find in this 0.9 WIP thread, plus additional users who reported & found issues.

LevelUP v0.9 - Some Major Changes

2013-05-20T00:00:00.000Z

LevelUP is still quite young and bound to go through some major shifts. It's best to not be too tied to immature APIs early in a project's lifetime.

That said, we're very interested in stability so we try to keep breaking changes to a minimum. However, we're about to publish version 0.9 and there's one change that's not exactly a "breaking" change in the normal sense, but it is something that I need to explain because it will impact on almost everyone currently using LevelUP.

Severing the dependency on LevelDOWN

LevelUP depends on LevelDOWN to do its LevelDB thing. LevelDOWN was once part of LevelUP until we split it off to a discrete project that focuses entirely on acting as a direct C++ bridge between LevelDB and Node. We get to focus on making LevelUP an awesome LevelDB-ish interface without being tied directly to LevelDB implementation details (e.g. Iterators vs Streams).

In fact, a new project was spawned to define the LevelDOWN interface that LevelUP requires. AbstractLevelDOWN is a set of strict tests for the functionality that LevelUP uses and it also implements a basic abstract shell that can be extended to create additional back-ends for LevelUP.

So far, there are 3 projects worth mentioning that extend AbstractLevelDOWN:

level.js operates on top of IndexedDB (which is in turn implemented on top of LevelDB in Chrome!).
leveldown-gap is another browser implementation that uses localStorage and is designed to be able to work in PhoneGap applications.
MemDOWN is a pure in-memory implementation that doesn't touch the disk. It's obviously not good for persistent data but sometimes that's not what you need.

Plus some other efforts to adapt other embedded and non-embedded data stores to the LevelDOWN interface. Additionally, there are other versions of LevelDB that can be used, including the fork that Basho maintains for use in Riak. (I have a branch of LevelDOWN that uses this version of LevelDB that I'll release as soon as I can explain and demonstrate the performance differences to vanilla LevelDB for Node users).

In short, LevelUP doesn't need LevelDOWN in the way it once did and LevelUP is turning into a more generic interface to sorted key/value storage systems, albeit with a distinct LevelDB-flavour.

Since version 0.8 we've supported a 'db' option when you create a LevelUP instance. This option can be used to provide an alternative LevelDOWN-compatible back-end. Unfortunately, LevelDOWN being defined as a strict dependency of LevelUP means that each time you install it you have to compile LevelDOWN, even if you don't want it. So, we've removed it as a dependency but it's still wired up so that that the only thing you need to do is actually install LevelDOWN alongside LevelUP and it'll take care of the rest.

$ npm install levelup leveldown

From version 0.9 onwards, you'll need to do this, or you'll see an (informative) error.

Introducing "Level"

To make life easier, we're publishing an additional package in npm that will make this easier by bundling both LevelUP and LevelDOWN as dependencies and exposing LevelUP directly. The Level package is a very simple wrapper that exists purely as a convenience. It'll track the same versioning as LevelUP so it's a straight substitution.

$ npm install level

You can simply change your "dependencies" from "levelup" to "level", plus you can use it just like LevelUP:

var levelup = require('level')
var db = levelup('./my.db')
db.put('yay!', 'it works!')

Switching things up

Now we have a properly pluggable back-end, expect to see a growing array of choice and innovation. The most exciting space at the moment is browser-land. Consider level.js:

var levelup = require('levelup')
  , leveljs = require('level-js')

window.db = levelup('foo', { db: leveljs })

db.put('name', 'LevelUP string', function (err) {
  db.get('name', function (err, value) {
    console.log('name=' + value)
  })
})

Yep, that's browser code. Simply npm install levelup level-js and run the module through Browserify and you get the full LevelUP API in your browser!

Stay tuned! This is just one step in the quest for a truly modular database system that lets you build a database that suits your applications and not the other way around.

Node.ninjas Presentation - LevelDB and Node Sitting in a Tree

2013-05-09T00:00:00.000Z

I'm giving a presentation at Node.ninjas tonight in Sydney. I've put together a talk about LevelDB and Node that covers:

What LevelDB is and the basics of how it works
A quick introduction to the core LevelDB libraries in Node: LevelUP and LevelDOWN
Some preaching about the awesomeness of modularity around a small, extensible core; including a whirlwind tour of the current, flourishing, LevelDB+Node ecosystem

It's this last point that excites me the most. There's some very smart people building some very clever pieces to the Node Database puzzle. What's more, people are actually building functional databases in Node now, I've just collected a list from npm of what looks like functional databases that use LevelDB:

Rumours
LevelGraph
PushDB
NeutrinoDB
PlumbDB
Syncstore

And a few more that look like a work in progress. Plus, I'm sure there's more people out there we've never even heard of who are cooking up some amazing things using the LevelDB+Node combination!

The slides to my talk are here.

LevelDB and Node: Getting Up and Running

2013-05-04T00:00:00.000Z

This is the second article in a three-part series on LevelDB and how it can be used in Node.

Our first article covered the basics of LevelDB and its internals. If you haven't already read it you are encouraged to do so as we will be building upon this knowledge as we introduce the Node interface in this article.

There are two primary libraries for using LevelDB in Node, LevelDOWN and LevelUP.

LevelDOWN is a pure C++ interface between Node.js and LevelDB. Its API provides limited sugar and is mostly a straight-forward mapping of LevelDB's operations into JavaScript. All I/O operations in LevelDOWN are asynchronous and take advantage of LevelDB's thread-safe nature to parallelise reads and writes.

LevelUP is the library that the majority of people will use to interface with LevelDB in Node. It wraps LevelDOWN to provide a more Node.js-style interface. Its API provides more sugar than LevelDOWN, with features such as optional arguments.

LevelUP exposes iterators as Node.js-style object streams. A LevelUP ReadStream can be used to read sequential entries, forward or reverse, to and from any key.

LevelUP handles JSON and other encoding types for you. For example, when operating on a LevelUP instance with JSON value-encoding, you simply pass in your objects for writes and they are serialised for you. Likewise, when you read them, they are deserialised and passed back in their original form.

Continue reading this article on DailyJS.com

LevelDB and Node: What is LevelDB Anyway?

2013-05-01T06:30:00.000Z

This is the first article in a three-part series on LevelDB and how it can be used in Node.

This article will cover the LevelDB basics and internals to provide a foundation for the next two articles. The second and third articles will cover the core LevelDB Node libraries: LevelUP, LevelDOWN and the rest of the LevelDB ecosystem that's appearing in Node-land.

What is LevelDB?

LevelDB is an open-source, dependency-free, embedded key/value data store. It was developed in 2011 by Jeff Dean and Sanjay Ghemawat, researchers from Google. It's written in C++ although it has third-party bindings for most common programming languages. Including JavaScript / Node.js of course.

LevelDB is based on ideas in Google's BigTable but does not share code with BigTable, this allows it to be licensed for open source release. Dean and Ghemawat developed LevelDB as a replacement for SQLite as the backing-store for Chrome's IndexedDB implementation.

It has since seen very wide adoption across the industry and serves as the back-end to a number of new databases and is now the recommended storage back-end for Riak.

Continue reading this article on DailyJS.com

Node.js Dublin Presentation - LevelDB

2013-05-01T06:00:00.000Z

I visited lovely Dublin last month to attend PeerConf. While there I got to meet a great bunch of Irish programmers at Node.js Dublin, a semi-regular Node.js meet-up that happens in the Engine Yard office in Dublin.

I was invited to give a presentation on LevelDB and the work that I've been doing on it in Node.js. I was followed by Dominic Tarr who's doing some amazing work on top of LevelDB.

You can view my slides here but a written version is currently being spread over 3 parts on DailyJS. More about that soon!

Announcing Bean v1.0.0

2012-09-08T12:41:12.000Z

In my previous post about Bean I discussed in detail the work that has gone in to a v1 release and how it will differ from the v0.4 branch.

Bean version 1.0.0 has now been released, you can download it from the GitHub repository or you can fetch it from npm for your Ender builds.

Here's a quick summary of the changes, but for a more in-depth look you should refer to my previous post.

on() argument ordering: the new signature is now .on(events[, selector], handlerFn), which will work on both Bean as a standalone library and when bundled in Ender. In Ender, the following aliases also pass through on() so the same arguments work: addListener(), bind(), listen() and one() (which of course will only trigger once). Plus all the specific shortcuts such as click(), keyup() etc. although these methods have the first argument hardwired.

add() is left intact with the same argument ordering for standalone Bean and delegate() has the same signature, the same as jQuery's equivalent.

off() is the new remove(): although remove() is still available in standalone Bean.

Bean attaches a single handler to the DOM for each event type on each element: as outlined above, Bean will iterate over all handlers for each triggered and (mostly) reuse the same Event object for each call.

Event.stopImmediatePropagation(): is available across all supported browsers, it will stop the processing of all handlers for the current event at the current element (i.e. the event will still bubble).

The selector engine argument to add() is now completely removed: you used to have to pass a selector engine in as the last argument for delegated events. Now you must set it once at start-up with setSelectorEngine(). This is automatically taken care of for you in an Ender build.

A duplicate-handler check is no longer performed when you add: performance testing showed that this was a massive slow-down and is simply not something that Bean should be responsible for. If you want to add the same handler twice then that's your business and responsibility.

Namespace matching for event fire()ing now matches namespaces using an and instead of an or: so for example, firing namespaces 'a.b' will fire any event with both 'a' and 'b' rather than either 'a' or 'b'. This is compatible with jQuery and is arguably a much more sensible and helpful way to deal with namespaces. You can find some discussion on this on GitHub.

Lots of internal improvements for speed, code size, etc..

There was one remaining question to be resolved—whether Event.stop() would also trigger Event.stopImmediatePropagation(). I've decided to not include it and leave it to the user to decide whether they want to prevent triggering of other listeners on the same event/element.

And that's it! Please give it a spin and open an issue on GitHub if you have any bugs to report or questions to be answered.

How Ender bundles libraries for the browser

2012-08-24T03:21:38.000Z

I was asked an interesting Ender question on IRC (#enderjs on Freenode) and as I was answering it, it occurred to me that the subject would be an ideal way to explain how Ender's multi-library bundling works. So here is that explanation!

The original question went something like this:

When a browser first visits my page, they only get served Bonzo (a DOM manipulation library) as a stand-alone library, but on returning visits they are also served Qwery (a selector engine), Bean (an event manager) and a few other modules in an Ender build. Can I integrate Bonzo into the Ender build on the browser for repeat visitors?

Wait, what's Ender?

Let's step back a bit and start with some basics. The way I generally explain Ender to people is that it's two different things:

It's a build tool, for bundling JavaScript libraries together into a single file. The resulting file constitutes a new "framework" based around the jQuery-style DOM element collection pattern: $('selector').method(). The constituent libraries provide the functionality for the methods and may also provide the selector engine functionality.
It's an ecosystem of JavaScript libraries. Ender promotes a small collection of libraries as a base, called The Jeesh, which together provide a large portion of the functionality normally required of a JavaScript framework, but there are many more libraries compatible with Ender that add extra functionality. Many of the libraries available for Ender are also usable outside of Ender as stand-alone libraries.

Continue reading this article on DailyJS.com

Towards Bean v1.0 (or: How event managers do their thing)

2012-08-10T13:20:56.000Z

Bean is the event manager included in Ender's starter pack, The Jeesh. If you want to do jQuery-style bind(), on() etc. with Ender, then use Bean.

At the time of writing, we're on version 0.4.11. There's also been a 0.5-wip ("work in progress") branch for a while now that's included some improvements I've been holding off for a major release. I also put together a 0.5 milestone on GitHub with some ideas. The major item impacting on the external API is a switch to the on() argument order found in Prototype, jQuery and Zepto. Considering the significance of the changes in the new branch, I think that perhaps a 1.0 release would be warranted.

Delegated `on()` argument ordering

Until now, Bean's add() has followed the same argument ordering as jQuery's bind() for standard events, and delegate() for delegated events; so the signature looks something like this: .add([selector, ]events, handlerFn) (.on() exists in the Ender bridge and does the same thing). The proposal was to change this to match the other major libraries', arguably more sensible, .on(events[, selector], handlerFn). This is now in the 0.5-wip branch.

Performance

Speed was another issue that I wanted to address for a new major release. Benchmarks have shown that Bean is under-performing in some areas and I believed it could do better. The process of analysing and addressing Bean's performance has been quite instructional and I've narrowed it down to some key trade-offs that authors of event libraries have deal with. One of the reasons I wanted to write this post was to outline some of these and solicit some feedback from the wider Bean-using community.

Performance trade-off #1: record keeping

When you call Element.attachEvent() (IE8 and below) or Element.addEventListener() (new browsers) you pass in a handler function that's called when the event in question is triggered. To stop that function being triggered you have to call Element.detachEvent() or Element.removeEventListener() and pass in that same function so the browser knows which handler you want to remove. Event managers like Bean and jQuery make that easier so you can do things like bean.remove(element, 'click') to remove all handlers; but Bean needs to know which handlers it needs to remove so it must keep records. The biggest change back in v0.4 of Bean was a switch to an internal registry that didn't molest DOM elements, external objects or external functions to attach identifiers so they could be later recalled. Previously, a uid property was set on each DOM element that you set a handler on and your handler function itself had a uid property set on it. jQuery does this too, it has a global jQuery.guid integer that it increments and attaches to pretty much everything. Don't be surprised when you find a guid property on your object/function/element once jQuery has got its fingers on it. This type of record keeping is fast and easy, but molesting other people's objects isn't very cool and there are alternatives.

My first major contribution to Bean was to switch it over to a registry similar to the one Deigo Perini has implemented in NWEvents. Bean now iterates and compares rather than looking up directly. It adds some overhead but I managed to squeeze in enough performance gains in other areas to make v0.4 generally faster than v0.3 even with the registry switch.

Performance trade-off #2: synthesising the Event object

The DOM Level 3 Events specification outlines a base Event object interface, along with specific event types that extend this and add extra attributes and methods. This is the object that you get when your event handler is triggered by the DOM, it's the object that you read keyCode from for keyboard events and the object that you call preventDefault() and stopPropagation() on.

The problem we have is that nobody actually implements the full spec as-is and we also have to deal with older browsers which have all sorts of interesting attributes and methods on their Event objects. The stand-out difference is that in IE8 and below, instead of calling Event.preventDefault() to prevent the default browser behaviour (e.g. following a link click or accepting a keypress), you have to Event.returnValue = false. And, instead of calling Event.stopPropagation() to stop the event from bubbling up the DOM to parent elements, you have Event.cancelBubble = true.

So, the standard practice is for event managers to either create an Event object for you and set up the properties and methods based on the underlying actual Event object (as in Bean, jQuery and most others), or fix the Event object (as in Prototype). The performance trade-off here is that this is not cheap to do, especially for every event you need to react to. But there are ways to speed it up.

In Bean v0.4 we introduced a property "whitelist" which provided significant performance gains. In v0.3 and prior, Bean would try and copy every property and method that it found on the original Event object over to a new object ({}). It turns out that accessing some of these properties on some browsers comes with a significant performance penalty, and often you just don't need them because they are specific quirks of individual browsers. Since v0.4, Bean has been only looking at a list of properties that it expects to find on particular types of event objects and ignoring the rest. In the 0.5-wip branch, I started caching event "fixers" for each event type as they were encountered, so it's a little faster to figure out exactly what needs to be done as events are triggered.

But, it's still costly, so that's where the next performance trade-off comes in.

Performance trade-off #3: hijacking event handler management

Given that synthesising the Event object is so costly and you end up doing it multiple times for a single event if you have more than one handler for that event, event managers have a trick up their sleeve to alleviate the pain. NWMatcher, jQuery and others don't directly attach your event handler to the DOM, instead, they attach a single internal handler that is responsible for triggering any number of handlers you register for a given event on a particular element.

Consider the following code:

for (var i = 0; i < 100; i++) {
  $('#el').bind('click', function () { console.log(i) })
}

This code would work in Bean and jQuery, the difference is that Bean v0.4 and prior adds 100 handlers directly to the DOM element to listen to that event while jQuery adds just one and iterates over the others when the event is triggered. The new version of Bean does the same.

The reason this helps with performance is that we don't have to make a new Event object each time the event is triggered, we can reuse the same one across handlers.

There's another major advantage to this approach, and perhaps a more important reason to implement an event manager this way: you get to hide some odd browser quirks. As Kit Cambridge noted recently, older versions of Internet Explorer generally fire their handlers in LIFO order, yet W3C specs for addEventListener() specifies FIFO order! In fact, it's even worse because the Microsoft documentation says that they may actually be triggered in random order! But, if you only have a single real handler then you get complete control over order.

The benefits go further though, we get to implement some nice features that are completely missing from older browsers and even some current browsers. The most notable is Event.stopImmediatePropagation(). This is a method that was introduced with DOM Level 3, so it's missing from IE8 and below, but surprisingly it's also missing from the current version of Opera! Perhaps the pressure is off because jQuery implements it as part of their relatively complete DOM Level 3 Events implementation using this single-DOM-handler method.

`stopImmediatePropagation()`

Bean has included a custom Event.stop() method since v0.4, it's modelled off the same method in Prototype. It's also found in MooTools and and some other libraries. This method combines both Event.stopPropagation() and Event.preventDefault() in a short and sweet little utility method. But, "stop" is slightly misleading, because you can stop the default behaviour of the browser and you can stop the event bubbling up the DOM, but you can't stop other event handlers for this event at this element from firing. That's where the new Event.stopImmediatePropagation() comes in: it halts the processing of the event handler list for the current event at the current element (i.e. it can be used at any point in the bubbling process and it'll stop processing just the handlers at the element it was called at).

If an event manager takes the single-DOM-handler approach, it has to care about stopImmediatePropagation() because it no longer has an affect in the browser since the browser only has a single handler to worry about. But, you also get the benefit that it now applies to any browser the event manager supports.

At the time of writing this article I haven't decided whether I think that Event.stop() should also bundle Event.stopImmediatePropagation(). I'm leaning towards including it because "stop" should mean stop and the combination of all three methods would certainly do this.

List of changes for Bean 1.0

on() argument ordering: the new signature is now .on(events[, selector], handlerFn), which will work on both Bean as a standalone library and when bundled in Ender. In Ender, the following aliases also pass through on() so the same arguments work: addListener(), bind(), listen() and one() (which of course will only trigger once). Plus all the specific shortcuts such as click(), keyup() etc. although these methods have the first argument hardwired.

add() is left intact with the same argument ordering for standalone Bean and delegate() has the same signature, the same as jQuery's equivalent.

off() is the new remove(): although remove() is still available in standalone Bean.

Bean attaches a single handler to the DOM for each event type on each element: as outlined above, Bean will iterate over all handlers for each triggered and (mostly) reuse the same Event object for each call.

Event.stopImmediatePropagation(): is available across all supported browsers, it will stop the processing of all handlers for the current event at the current element (i.e. the event will still bubble).

The selector engine argument to add() is now completely removed: you used to have to pass a selector engine in as the last argument for delegated events. Now you must set it once at start-up with setSelectorEngine(). This is automatically taken care of for you in an Ender build.

A duplicate-handler check is no longer performed when you add: performance testing showed that this was a massive slow-down and is simply not something that Bean should be responsible for. If you want to add the same handler twice then that's your business and responsibility.

Namespace matching for event fire()ing now matches namespaces using an and instead of an or: so for example, firing namespaces 'a.b' will fire any event with both 'a' and 'b' rather than either 'a' or 'b'. This is compatible with jQuery and is arguably a much more sensible and helpful way to deal with namespaces. You can find some discussion on this on GitHub.

Lots of internal improvements for speed, code size, etc..

Deconstructing performance (benchmarks)

We've had a benchmark suite since v0.4 to help measure the impact of changes, so I've extended it to help compare some versions of Bean. The benchmarks use benchmark.js.

There are 3 versions of Bean included here:

Bean 0.4: The current release of Bean, specifically version 0.4.11-1, source here.
Bean 0.5a: An unreleased version of Bean in the 0.5-wip branch. Specifically most of the changes listed above are included here except for the single-DOM-handler change. This is here to assess the impact of this change and deciding whether it's a worthwhile "improvement". Source here.
Bean 1.0a: The main difference between this and 0.5a is the single-DOM-handler change. Source here.

I'll have some notes about my own analysis of these numbers below but first I should mention that these benchmarks are not particularly helpful in showing how the libraries perform with real use patterns. I consider them to mainly be proxies for identifying the performance of particular behaviours within the libraries. You'll note that there are a lot of tests for add() / on(), that's simply because that's the easiest to test reliably and also because I haven't been bothered coming up useful with tests for other things. It's very difficult to test the actual event triggering which would be the most interesting bit, although the fire() tests give us a little bit of insight. The tests at the bottom try to capture a full add/fire/remove cycle, but even this isn't even particularly helpful. These benchmarks can be found in the Bean repo so if you want to tinker then feel free, I'd love to have additional input.

So, more so than most benchmarks, take these with a very large grain of salt or two!

(The numbers are ops/sec, so higher is better in all cases)

Chrome

	Bean 0.4	Bean 0.5a	Bean 1.0a	NWEvents	jQuery
add(element, event, fn)	25,760	66,580	185,147	18,133	142,161
add(unique element, event, fn)	33,024	99,208	36,481	18,634	50,554
add(element, custom, fn)	28,728	56,607	165,189	11,248	119,593
add(unique element, custom, fn)	36,150	78,260	34,308	24,409	44,761
add(element, event.namespace, fn)	30,082	64,435	189,468		136,486
add(unique element, event.namespace, fn)	33,702	101,915	34,678		33,637
add(element, selector, event, fn)	25,180	42,274	119,339	2,909	76,171
add(unique element, selector, event, fn)	27,328	91,156	30,308	1,069	35,696
add({})	15,594	27,312	59,434
fire(event)	576	492	6,860	9,797	21,821
fire(custom)	165,222	164,418	161,243	240,961	86,291
fire(namespace)	29,742	28,721	27,666
element add / click / remove	18,579	17,425	14,760	1,748	2,775
element add / fire / remove	31,230	28,344	15,802	1,127	2,763
object add / fire / remove	58,927	53,139	49,549	107,700	18,619

Firefox

	Bean 0.4	Bean 0.5a	Bean 1.0a	NWEvents	jQuery
add(element, event, fn)	20,404	45,030	100,546	13,826	63,159
add(unique element, event, fn)	16,708	67,417	19,625	16,810	29,130
add(element, custom, fn)	16,691	42,601	134,535	13,368	59,774
add(unique element, custom, fn)	24,159	55,312	21,235	13,475	27,877
add(element, event.namespace, fn)	17,414	53,639	101,427		55,321
add(unique element, event.namespace, fn)	23,735	59,751	22,034		27,576
add(element, selector, event, fn)	18,766	54,571	92,602	2,317	36,753
add(unique element, selector, event, fn)	22,094	56,026	16,705	964	22,102
add({})	9,126	17,104	32,093
fire(event)	260	266	3,391	3,120	11,154
fire(custom)	61,845	59,950	61,742	93,033	45,978
fire(namespace)	28,910	27,379	23,127
element add / click / remove	7,644	6,220	6,005	1,284	4,845
element add / fire / remove	11,288	10,954	7,458	788	9,115
object add / fire / remove	45,165	37,934	37,306	38,097	12,490

IE9

	Bean 0.4	Bean 0.5a	Bean 1.0a	NWEvents	jQuery
add(element, event, fn)	925	944	209,714	4,321	117,343
add(unique element, event, fn)	13,559	113,944	10,568	3,012	58,929
add(element, custom, fn)	946	1,004	219,631	4,329	128,570
add(unique element, custom, fn)	7,557	123,288	12,620	3,191	32,610
add(element, event.namespace, fn)	880	826	87,932		53,737
add(unique element, event.namespace, fn)	11,823	103,977	12,001		28,053
add(element, selector, event, fn)	655	802	57,619	382	21,159
add(unique element, selector, event, fn)	11,649	96,597	11,404	139	24,756
add({})	53	49	17,735
fire(event)	290,543	286,385	293,547	71,396	22,794
fire(custom)	229,241	223,189	216,943	78,395	23,081
fire(namespace)	17,507	11,848	16,018
element add / click / remove	10,228	9,697	9,260	478	8,345
element add / fire / remove	13,062	10,587	18,577	155	6,094
object add / fire / remove	30,924	29,096	28,904	39,761	7,634

First, let me say that the IE results don't make a whole lot of sense so I'm going to suggest that the Chrome and Firefox benchmarks are the best indicators of general performance characteristics across browsers. The IE results have similar patterns to the others but there's way too much strangeness in there for me to take them seriously! IE8 has difficulty running all the benchmarks without locking up and I don't care enough to persevere there so I'm ignoring that too. Safari crashes and Opera has very similar results to Firefox and Chrome.

(Just to clarify, it's only the benchmarks that have trouble running in older versions of IE, the Bean test suite still runs on IE6 and above and has been beefed up even more in the 0.5-wip branch.)

Some observations

The gains for add() from Bean v0.4 to v0.5a are largely from removing the duplicate handler check.
The reason for the duplicate tests for "element" vs "unique element" in the add() benchmarks is to demonstrate the costs and benefits involved the single-DOM-handler model. You can see that the numbers switch between the non-unique / unique tests for Bean v0.5a and v1.0a. Also jQuery suffers significantly when you feed it unique elements because it has to add DOM handlers each time.
The poor performance for Bean v0.4 and v0.5a in fire() benchmarks is mostly attributed to Event object synthesising, rather than the speed of the browser-native handler list management. This is important because firing native-style events (e.g. fire('click'), which is what we're testing here) is not a common activity but we're having to synthesize the event object each time a handler is triggered. So, this is where Bean finds the most win in switching to a single-DOM-handler model.
Bean loses performance between v0.5a and v1.0a in the unique element add() benchmarks, this can mostly be explained by the overhead of managing the root handler that it needs to attach to the DOM. The handler is stored in the internal registry and each time you add() it needs to work out if you already have a root handler attached to the DOM or not for the given event / element. jQuery gets to take some shortcuts by polluting the DOM and handler functions with guid properties. However, the numbers suggest to me that there is some additional performance that could be squeezed out of Bean in this area.
Bean is fairly liberal with its whitelist of properties to copy from the original Event object, jQuery is a bit more restrictive with its similar system, this may slow Bean down very slightly.
Delegated events are not represented well here, but the results would be very interesting because of the additional work required.

File size

A lot of users of Bean are file-size-sensitive, so it's important to highlight that there are costs to these performance improvements. Minified, gzipped, the sizes for each of these versions of Bean are:

Bean 0.4	3870 bytes
Bean 0.5a	3959 bytes
Bean 1.0a	4176 bytes

I've tried really hard to keep the size under 4kb but the additional overhead in managing the single-DOM-handler is too much to achieve that, even though I've managed to shave many precious bytes off in other areas of the code in the process (which unfortunately can't be seen in these numbers!).

We're still well under the minified, gzipped size of the jQuery events module by itself, even though we implement very similar functionality and jQuery gets to leverage lots of internal sugar not contained within the events module.

Request for feedback

After all that, what I really want is feedback! At this point I'm happy to release a proper version 1.0, I think it's major enough to warrant a jump past 0.5. I'd really like to hear feedback from people that have doubts that the changes are worth it, particularly the single-DOM-handler change.

Using the 1.0 pre-release

I've started using it in production and am very happy with the results so far, I'd love to have feedback from anyone else who wants to give it a spin.

The new version of Bean is in npm with the tag dev so you can include it in your Ender builds by referring to bean@dev as the package name.

For stand-alone, you can grab it from the 0.5-wip branch on GitHub.

Thanks for getting this far!

mod_geoip2_xff update

2012-07-06T02:47:17.000Z

Thanks to a contribution from Kevin Gaudin, I have a new release of my mod_geoip2 fork. (The history starts here.)

You can find the source here: https://github.com/rvagg/mod_geoip2_xff

Kevin's addition provides a fall-back to the standard remote IP address of the client if no public IP address is found in the X-Forwarded-For header. Previously, my implementation just fell back to the default mod_geoip2 behaviour of just taking the first IP address in the X-Forwarded-For header, or the last if you set GeoIPUseLastXForwardedForIP in your config.

I also took the opportunity to clean things up a little and introduce a config option to turn on the special X-Forwarded-For handling. You now have to set GeoIPUseLeftPublicXForwardedForIP to On to activate it.

Thanks to Kevin, and additional contributions are welcome!

Update July 7th 2012: Since I was in C-mode, I went ahead and implemented something I've tried to get working in the past: hostname lookups on the X-Forwarded-For host! I got intimate with APR and worked out how to use Apache to do the resolution so there isn't the lengthy timeout of raw syscalls. If you set GeoIPEnableHostnameLookups to On, you'll get a GEOIP_HOST environment variable to use.

I've also decided to start making tarballs available off GitHub for your convenience: https://github.com/rvagg/mod_geoip2_xff/downloads

Data URI + SVG

2012-05-23T05:16:22.000Z

Data URIs are great when you want to serve small resources that there's no point serving up in a combined sprite. Consider microjs.com which serves up an HTML file plus a single JavaScript file containing the latest data used to build the site. The build logic is in an embedded script, the CSS is also embedded, so it's pretty lean considering what you see and the amount of data displayed. But, notice the 3 icons for each project, 2 GitHub icons and a Twitter icon. They are PNG images, combined as a sprite but to avoid an additional HTTP request to fetch them they are simply embedded in the CSS which is embedded on the page:

.title .stat span {
  background-image: url("data:image/png;base64,iVBORw0KGgoAAAANSUhE...
}

Easy and quick and fairly well supported across browsers.

But Data URIs can do so much more, including embed SVG!

url("data:image/svg+xml, viewBox='0 0 40 40' height='25' width='25'
xmlns='http://www.w3.org/2000/svg'> fill='rgb(91, 183, 91)' d='M2.379,
14.729L5.208,11.899L12.958,19.648L25.877,6.733L28.707,9.561L12.958,25.308Z'
/>")

The above will produce a 25px square image but the SVG is drawn in a 40x40 coordinate box, because I'm using a Raphaël Icon paths (you can try it yourself by replacing the d='' content with the path data you get when you click on any of the icons on the Raphaël Icons page.)

SVG of course gives you perfectly scalable graphics, embedding in a Data URI in your CSS lets you use them in the same way that you use other CSS images, minus the need to fetch them via an additional HTTP request.

What's the catch?

It's the web, of course there's a catch, and of course it involves Internet Explorer!

For a start you don't get SVG support in IE8 and below, which is a bit of a problem right now because IE8 is still very much with us due to the fact that IE9 isn't available for Windows XP users. But there's more than that. IE adheres to the spec more strongly than other browsers in that there are 2 types of encoding for Data URIs, base64 and non-base64. If you leave the ;base64 off your string then most browsers let you get away with anything that doesn't conflict with standard CSS, so basically don't use ", or if you do, escape them with simple \". What the Data URI spec says is:

...the data (as a sequence of octets) is represented using ASCII encoding for octets inside the range of safe URL characters and using the standard %xx hex encoding of URLs for octets outside that range.

And IE doesn't let you have it any other way. So you either encode your SVG into Base64 or escape it with %xx's, which kind of loses some of the elegance of SVG in CSS. But at least you'll get IE9+ support.

So here's some examples to fiddle with. Click through to the CSS tab to see the gory details. The first icon is Base64 encoded, the second icon is URL escaped (%xx), the rest are just plain SVG, so you'll get different results viewing in IE9 vs the rest.

SVG in Data URIs is an elegant solution (and a bit of fun) but only really useful at the moment if you don't need to support IE8 and below.

Update 17th Sept 2012

Below in the comments, Ben reports on his (much more rigorous) research into browser support; refer to that if you're serious about using SVG in Data URIs. An interesting result of his work comes from the issue he filed with Chromium (I don't know if this is a generic WebKit thing or not but you could easily test if you're interested). It turns out that Chromium/WebKit requires Base64 Data URIs to be multiples of 4 characters, so you just need to pad with ==.

A mod_geoip2 that properly handles X-Forwarded-For

2012-04-22T04:55:42.000Z

This is just a short follow-up to my original post on Wrangling the X-Forwarded-For Header where I promised that one of the things I would follow up with was how to get MaxMind's mod_geoip2 to handle the X-Forwarded-For according to the rule:

Always use the leftmost non-private address.

Well, since it's turning out to be such a popular post I thought I'd better get it done to help anyone else out that's searching around for solutions. So, I've put up the code on my GitHub account here:

https://github.com/rvagg/mod_geoip2_xff

I'm maintaining a maxmind branch that contains the original code from MaxMind and the master contains my changes, so you can see a nice diff of what I've done.

I have to warn that I haven't done any serious C programming for more than 15 years or so, my code probably isn't fantastic, and I'm open to outside contributions from anyone with suggestions. The approach I've taken is to embed the regexes of my previous post into the module and walk through the IP addresses looking for a non-private match.

Since my initial release, based on MaxMind's 1.2.5, they've put out a 1.2.7 which includes the addition of a GeoIPUseLastXForwardedForIP flag. I can imagine what prompted this addition but as I said in my previous post this isn't the way to get the best IP address. As of writing, my current master branch is based on 1.2.7 and has this new flag but because the first_public_ip_in_list is done first it's mostly useless.

If anyone wants to hassle MaxMind on my behalf then feel free, I sent them an email a couple of months ago about this but received no answer.

Update 6-July-2012: A new release with some changes, details here.

JavaScript and Semicolons

2012-04-20T06:10:16.000Z

In syntax terms, JavaScript is in the broad C-family of languages. The C-family is diverse and includes languages such as C (obviously), C++, Objective-C, Perl, Java, C# and the newer Go from Google and Rust from Mozilla. Common themes in these languages include:

The use of curly braces to surround blocks.
The general insignificance of white space (spaces, tabs, new lines) except in very limited cases. Indentation is optional and is therefore a matter of style and preference, plus programs can be written on as few or as many lines as you want.
The use of semicolons to end statements, expressions and other constructs. Semicolons become the delimiter that the new line character is in white-space-significant languages.

JavaScript’s rules for curly braces, white space and semicolons are consistent with the C-family and its formal specification, known as the ECMAScript Language Specification makes this clear:

Certain ECMAScript statements (empty statement, variable statement, expression statement, do-while statement, continue statement, break statement, return statement, and throw statement) must be terminated with semicolons.

But it doesn’t end there–JavaScript introduces what’s known as Automatic Semicolon Insertion (ASI). The specification continues:

Such semicolons may always appear explicitly in the source text. For convenience, however, such semicolons may be omitted from the source text in certain situations. These situations are described by saying that semicolons are automatically inserted into the source code token stream in those situations.

The general C-family rules for semicolons can be found in most teaching material for JavaScript and has been advocated by most of the prominent JavaScript personalities since 1995. In a recent post, JavaScript’s inventor, Brendan Eich, described ASI as “a syntactic error correction procedure”, (as in “parsing error”, rather than “user error”).

The rest of this article about semicolons in JavaScript can be found on DailyJS.

Minifying HTML in the Servlet container

2011-11-23T04:38:44.000Z

Google's mod_pagespeed is great. I've been using it for a while now on feedxl.com but the only filter that I actually find really useful is Collapse Whitespace; the rest of the filters I either already do myself as part of the site build process or I don't want applied. But, I imagine that there are a lot of admins out there that would really benefit from all of the clever things it can do.

Unfortunately it's just an Apache2 module so it's a bit difficult to use the cleverness elsewhere. I recently launched a new service that serves content directly from Apache Tomcat without passing through an Apache2 web server like I would normally do (because there was just no need!). Having got used to the nice whitespace optimisations you can get from mod_pagespeed I decided to implement a simple version of my own for Tomcat. Dynamic content is somewhere that you're better off trying not to optimise your whitespace during generation, leave it for post-processing so your logic can be clear.

So, enter HTMLMinifyFilter. It's nowhere near as clever as mod_pagespeed but it'll do for basic needs. The core of it is a regular expression that will remove certain patterns and it's configurable so you decide which patterns to include.

package au.com.xprime.misc.webapp.filter;

import java.io.*;
import java.util.regex.*;
import javax.servlet.*;

public class HTMLMinifyFilter implements Filter {
    private Pattern regex = null;

    public void doFilter(ServletRequest req, ServletResponse res, FilterChain chain) throws IOException, ServletException {
        HttpServletResponse response = (HttpServletResponse) res;
        ResponseWrapper wrapper = new ResponseWrapper(response);
        chain.doFilter(req, wrapper);
        String html = wrapper.toString();
        if (regex != null && response.getContentType() != null && response.getContentType().startsWith("text/html"))
            html = regex.matcher(html).replaceAll("");
        response.setContentLength(html.getBytes().length);
        PrintWriter out = response.getWriter();
        out.write(html);
        out.close();
    }

    public void destroy() {
    }

    public void init(FilterConfig config) throws ServletException {
        StringBuffer pattern = new StringBuffer();
        appendIf(config, "strip-linestart-whitespace", pattern, "(?<=^)[ \\t]+");
        appendIf(config, "strip-lineend-whitespace", pattern, "[ \\t]+(?:$)");
        appendIf(config, "strip-multiple-whitespace", pattern, "([ \\t](?:[ \\t]))+");
        appendIf(config, "strip-blank-lines", pattern, "(\\n[ \\t]*(?:\\n))+");
        if (pattern.length() != 0)
            regex = Pattern.compile(pattern.toString(), Pattern.MULTILINE);
    }

    private void appendIf(FilterConfig config, String configKey, StringBuffer pattern, String s) {
        if (config.getInitParameter(configKey) != null && config.getInitParameter(configKey).equals("true")) {
            if (pattern.length() != 0)
                pattern.append('|');
            pattern.append(s);
        }
    }

    static class ResponseWrapper extends HttpServletResponseWrapper {
        private CharArrayWriter output;

        public ResponseWrapper(HttpServletResponse response) {
            super(response);
            this.output = new CharArrayWriter();
        }

        public String toString() {
            return output.toString();
        }

        public PrintWriter getWriter() {
            return new PrintWriter(output);
        }
    }
}

How does it work?

We start off by wrapping our response in an object that will supply a CharArrayWriter so we can capture and process whatever the rest of the stack is doing (credit for this idea goes here). We can then process the output with our regular expression(s) and pass it to the real response.

Before I explain what the regular expressions do I want to caution that this won't be satisfactory in certain situations. It's not aware of

r.va.gg

The perils of private politics in open source

The Node community and its leadership is evolving

The inevitability of politics

Serving as counter balance to the corporate

A busy time for private politicking

Finding a balancing point

Node.js and the "HashWick" vulnerability

Responsible disclosure

Vulnerability details

Hash functions

Hash flooding attacks

Timing attacks

Mitigation

The future for V8

Background Briefing: August Node.js Security Releases

OpenSSL: Client DoS due to large DH parameter (CVE-2018-0732)

OpenSSL: Cache timing vulnerability in RSA key generation (CVE-2018-0737)

OpenSSL: ECDSA key extraction local side-channel

Unintentional exposure of uninitialized memory in Buffer creation (CVE-2018-7166)

Out of bounds (OOB) write in Buffer (CVE-2018-12115)

The Truth About Rod Vagg

My response

The process so far

Response to the list of complaints

1.

2.

3.

Board accusations

That I am a barrier to inclusivity efforts

Threats to the independence of the technical group

The threat to future leadership of the project

In summary

Why I don't use Node's core 'stream' module

TL;DR

The good 'ol days

Welcome to Node 0.10

Backward-compatibility

"readable-stream" to the rescue

Streams3: a new flavour

What is your base implementation?

Taking control

Small core FTW!

Addendum: "through2"

NodeSchool comes to Australia

NodeSchool comes to Australia

NodeSchool in Sydney

NodeSchool in Melbourne

NodeSchool in Hobart

Testing code against many Node versions with Docker

DNT: Docker Node Tester

How I'm using it

What's next?

LevelDOWN v0.10 / managing GC in native V8 programming

V8 Persistent references

Persistent in LevelDOWN; some removed, some added!

Leaks!

Snappy 1.1.1

Some embarrassing bugs

Domains

Summary

A final note on Node 0.11.9

All the levels!

A note about HyperLevelDB

Should I use a single LevelDB or many to hold my data?

What's the question?

This stuff doesn't belong together!

Technical considerations

Size and performance

Cache

Consistency

Back-end flexibility

Summary

Primitives for JS Databases (an LXJS adventure)

Thanks to the LXJS team

NodeConf.eu

A Real Database Rethink

Level Me Up Scotty!

The Level* Gang

Learn You The Node.js

Unintentional exposure of uninitialized memory in `Buffer` creation (CVE-2018-7166)

Out of bounds (OOB) write in `Buffer` (CVE-2018-12115)

V8 `Persistent` references

`Persistent` in LevelDOWN; some removed, some added!

Delegated `on()` argument ordering

`stopImmediatePropagation()`