Who we are

Contacts

1815 W 14th St, Houston, TX 77008

281-817-6190

AWS keyspaces

Migrating Apache Cassandra to AWS Keyspaces

Apache Cassandra is an open-source, NoSQL database with a distributed architecture to maximize availability and reliability. Due to the relative ease with which clusters may be scaled, Cassandra is a popular choice for use cases involving heavy writes with simple, pre-defined querying needs. Despite this ease of use, the effort involved in managing a Cassandra cluster is non-negligible, especially as cluster size and complexity grow.

Organizations that manage Cassandra clusters independently may face high operational costs, manual patching, and continuous maintenance efforts. Some organizations may wish to focus more on strategic objectives, such as delivering business value and providing better user experiences, derived from their NoSQL data. For those organizations, migrating to a fully managed, Cassandra-compatible database service such as AWS Keyspaces can help achieve these goals.

Migration Overview

Overall, the pattern of Cassandra data migration is straightforward. In broad strokes, the steps are as follows:

  1. Data Offload
    The data residing on the Cassandra cluster must be extracted and saved on a file system or object store. There are several different tools with which this may be accomplished, which I will cover later in this article. The best migration tool will depend on data volumes and target timelines, and the chosen tool will be used both to offload data from the existing Cassandra instance and upload it to AWS Keyspaces.
  2. Randomization and AnalysisRandomization
    When offloading data from Cassandra to a file, it’s typically written in key-sorted order. Behind the scenes, data stored in Keyspaces is partitioned. To optimize the subsequent data load, it’s ideal that inserted data is spread evenly across partitions. Randomizing the order of data on the output files can help ensure that individual Keyspace partitions aren’t under disproportionate load while other partitions sit unused.

    Analysis
    Although AWS Keyspaces is very similar to a standard Cassandra implementation, there are some minor differences. Generally, these differences do not have a major impact on usage; however, during migration, one key difference can impact the smooth upload of data to a target table. AWS Keyspace tables have a maximum row size of one megabyte, so ensuring that individual rows in offloaded data files do not exceed this limit will prevent the headache of dealing with failed inserts. If any rows do exceed this limit, additional steps, such as compression or even splitting rows into separate tables, may be necessary. Additional analysis of the data, such as assessing total row counts and file sizes, will help in subsequent steps of the migration.

  3. Target Keyspace & Table Creation
    Using the DDL of existing tables and keyspaces, counterpart artifacts should be created in AWS.
  4. Data Load
    Before beginning the load process, some configuration work must take place to optimize the costs and time of the table population. AWS Keyspaces provides two billing options for serverless management: on-demand and provisioned capacity modes. While on-demand capacity mode may suffice for typical day-to-day operations, a full migration is far from a normal use case. While it’s certainly possible to migrate using only on-demand capacity, those who try it may encounter excessive table load times. Using provisioned capacity and relevant metadata gathered from the earlier analysis step will aid in a time and cost-optimized migration. Once the appropriate throughput capacity has been set, the previously chosen migration tool can begin the loading process.

Migration Tools

CQLSH

CQLSH is packaged and included with Cassandra, and functions as the default command-line interface for interacting with the database. It’s implemented in Python, with Cassandra versions up to 3.x supporting Python 2.7 while versions 4 and on support Python 3.8. The `Copy From` command serves to both load and offload data to/from Cassandra tables, and just this functionality could theoretically suffice to carry out a migration. In practice, however, CQLSH is slower than the alternative tools available, and may not be appropriate for moderate to large-scale migration efforts. Furthermore, it does not provide other migration tooling, such as file analysis, randomization, or job orchestration; that functionality must be developed separately. However, its ease of use and by default availability make it better suited for simple, small-scale migrations or initial prototyping/proof-of-concept projects.

DataStax Bulk Loader

Developed by DataStax, DSBulk is an open-source tool for loading and offloading data in Cassandra and Cassandra-compatible databases. Compared to the CQLSH `Copy` command, DSBulk offers anywhere from 2–4 times faster data loads and offloads. It also offers several quality-of-life improvements such as support for multiple file types, customizable date formatting, progress reporting, and resumption of failed processes via checkpoint files. DSBulk can be installed directly on the Cassandra cluster or deployed on a separate server. While DSBulk performs better than CQLSH in speed and offers some built-in tooling, it still lacks some necessary migration functionality, mainly file randomization and job orchestration. This again necessitates the use of additional tooling or the development of custom solutions. DSBulk is overkill for small, simple migrations but lacks out-of-the-box functionality needed to carry out large, complex migrations. As such, it is best suited for migrations of moderate size and complexity.

Spark

The Spark-Cassandra Connector, also developed by DataStax, makes Spark a viable tool for a Cassandra to Keyspaces migration. This connector exposes Cassandra tables as RDDs, which allows Spark to offload those tables to files as well as carry out in-memory data analysis and randomization. Organizations with underutilized, self-hosted, or self-managed Spark clusters can easily repurpose those resources to aid their migration efforts. However, it is unlikely that many groups will find themselves with substantial spark clusters just ‘lying around’. The benefits of standing up a self-managed or self-hosted spark cluster for a one-time migration will most likely not be worth the time or resource investment. A migration to an AWS service does however present the option of using one of AWS’s managed Spark services, such as Glue or EMR.

I’ve previously covered the pros and cons of each service for data lake projects, and many of these same considerations hold true in the context of a Cassandra to Keyspaces migration. In short, the determining factor for which tool to use comes down to what objectives are being prioritized: the ability to get up and running quickly with pre-built quality-of-life features of Glue or the ability to optimize speed and cost via fine-tuned cluster configuration in EMR. Although a fine-tuned EMR cluster can be more cost-effective than Glue, both services are more resource-intensive than CQLSH or DSBulk. Thus, Spark in general, and Glue and EMR specifically, are best suited for large, complex migrations. In such migrations, the additional cost overhead can be justified by the benefit of using a single tool for randomization, analysis, and data off/loading. Glue and EMR in particular can easily be slotted into an existing AWS cloud data pipeline deployment and orchestration framework.

Additional Considerations

Although Cassandra to AWS Keyspaces migrations are conceptually simple, in practice, there are additional factors that can impact how a migration takes shape. The process of offloading data from a Cassandra cluster can be resource intensive, and potentially impact application performance or even cluster stability. There are several ways in which this issue can be addressed. The more cost-intensive option is to scale up or even deploy a new cluster to insulate existing application workloads from the impact of the migration. A more impactful but cheaper approach is to incur application downtime for the duration of the switchover. Although less than ideal for users of the application, this approach has the benefit of avoiding incremental data loads as once the application has been taken down, no new data will be written to the cluster and subsequently migrated as part of a delta load. Every organization has unique needs, so there is no one-size-fits-all approach to this tradeoff. Generally speaking though, Cassandra clusters that serve internal customers or applications are better candidates for downtime. Conversely, instances that face external customers or are part of revenue-generating applications may justify the additional overhead of temporarily scaling up cluster size or even a full new deployment.

Another option to consider is the approach of writing to the existing legacy Cassandra cluster and the new Keyspaces instance simultaneously. As this likely entails some refactoring of application code, this is easier said than done. However, it does offer some benefits during a migration. This can be carried out before a full migration, as a way to evaluate the performance or verify the compatibility of AWS Keyspaces. It can also simulate production workloads to optimize serverless resources with minimal customer impact. This approach has the added benefit of avoiding incremental data migration loads. As there is a set point in time after which data on the new and legacy instances match, only data that predates that set point needs to be egressed and loaded into the new table. While there are benefits to this dual-write strategy, the additional complexity involved in refactoring and deploying code on top of a migration effort may offset these benefits for some.

Conclusion

Migrating from a self-managed Cassandra cluster to AWS Keyspaces is not necessarily the right decision for all organizations. However, those who wish to divert limited time and resources from hands-on cluster management and maintenance towards more strategic efforts will find that Keyspaces has much to offer. While the individual steps of such a migration are straightforward, choosing the right tool for the effort can help an organization optimize between speed and cost. Accounting for the effect that the migration will have on production systems will aid in striking a balance between user impact and additional project overhead.