Isilon – The Challenge of Files at Scale

I get a lot of requests to post about iSilon so I hooked up with Ron Steinke, A Technical Staff member of the iSilon Software Engineering to write some guest posts, I would really appreciate the feedback and whether you would like us to write more about iSilon

Scalability is a harder problem for stateful, feature rich solutions. Distributed filesystems are a prime example of this, as coordinating namespace and metadata updates between multiple head nodes presents a challenge not found in block or object storage.

The key is to remember that this challenge must be viewed in the context of the simplification it brings to application development. Application developers choose to use files to simplify the development process, with less concern about what this means for the ultimate deployment of the application at scale. For a process limited by the application development lifecycle, or dependent on third party applications, the tradeoff of utilizing a more complex storage solution is often the right one.

Part of the challenge of file scalability is fully replicating the typical file environment. Any scalability solution which imposes restrictions which aren’t present in the development environment is likely to run against assumptions built into applications. This leads to major headaches, and the burden of solving them usually lands on the storage administrator. A few of the common workarounds for a scalable flat file namespace illustrate these kinds of limitations.

One approach is to have a single node in the storage cluster managing the namespace, with scalability only for file data storage. While this approach may provide some scalability in other kinds of storage, it’s fairly easy to saturate the namespace node with a file workload.

A good example of this approach is default Apache HDFS implementation. While the data is distributed across many nodes, all namespace work (file creation, deletion, rename) is done by a single name node. This is great if you want to read through the contents of a large subset of your data, perform analytics, and aggregate the statistics. It’s less great if your workload is creating a lot of files and moving them around.

Another approach is namespace aggregation, where different parts of the storage array service different parts of the filesystem. This is effectively taking UNIX mount points to their logical conclusion. While this is mostly transparent to applications, it requires administrators to predict how much storage space each individual mount point will require. With dozens or hundreds of individual mount points, this quickly becomes a massive administration headache.

Worse is what happens when you want to reorganize your storage. The storage allocations that were originally made typically reflect the team structure of the organization at the time the storage was purchased. Organizations being what they are, the human structure is going to change. Changing the data within a single mount point involves renaming a few directories. Changes across mount points, or the creation of new mount points, involve data rewrites that will take longer and longer as the scale of your data grows.

Clearly these approaches will work for certain kinds of workflows. Sadly, most storage administrators don’t have control of their users’ workflows, or even good documentation of what those workflows will be. The combination of arbitrary workflows and future scaling requirements ultimately pushes many organizations away from limited solutions.

The alternative is a scale-out filesystem, which looks like a single machine both from the users’ and administrators’ perspective. A scale-out system isolates the logical layout of the filesystem namespace from the physical layout of where the data is stored. All nodes in a scale-out system are peers, avoiding specials roles that may make a particular node a choke point. This parallel architecture also allows each scale-out cluster to grow to meet the users’ needs, allowing storage sizes far larger than any other filesystem platform.

There are four main requirements to provide the transparency of scale-out:

  • A single flat namespace, available from and serviced by all protocol heads. This removes the scaling limitation of a single namespace node, by allowing the capacity for namespace work to scale with the number of nodes in the system.
  • Flat allocation of storage space across the namespace. While the data may ultimately be stored in different cost/performance tiers, these should not be tied to artificial boundaries in the namespace.
  • The ability to add space to the global pool by adding new storage nodes to the existing system. Hiding this from the way applications access the system greatly simplifies future capacity planning.
  • Fail-in-place, the ability of the system to continue operating if drives or nodes fail or are removed from the system. This removal will necessarily reduce the available storage capacity, but should not prevent the system from continuing to function. All of these capabilities are necessary to take full advantage of the power of a scale-out filesystem. Future posts will discuss some of the benefits and challenges this kind of scalable system brings. In my next post, we’ll see how the last two elements in the list help to enhance the long-term upgradability of the system.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: