Migrating from Snowflake to BigQuery

Written by
Tommy Fu
Published on
May 20, 2024

Introduction: Why Choose BigQuery?

Our journey to adopt Google BigQuery began with a PoC(proof of concept), focusing on which platform offered faster processing times and lower costs. BigQuery stood out, proving to be the more efficient and economical choice for handling our large-scale data workloads. Both time and costs were approximately reduced in half when running a medium warehouse in Snowflake and 200 slots in BigQuery. The results led our client Majority, to switch their data warehousing operations from Snowflake to BigQuery.

Setting the Stage in Snowflake

The Project Scope

The project was carefully designed to focus solely on transferring the data pipelines while leaving the original AWS data sources untouched. Avoiding changes to the production code of applications was crucial to minimize disruption. Keeping the option open to revert to Snowflake provided a safety net, enabling us to test BigQuery’s capabilities without fully committing all resources from the start.

The data team at Majority were very skilled in modern technical practices, especially in using Infrastructure as Code (IaC) and Continuous Integration/Continuous Deployment (CI/CD). These skills were invaluable, as the existing infrastructure components for AWS and Snowflake were already well-documented in Terraform, making it easier to streamline the changes when migrating to BigQuery.

Phase One: Automating Data Ingestion from AWS to BigQuery

Developing the Data Ingestion Pipeline

Our first step in the migration process was to set up an automated system for moving data from AWS S3 buckets to Google BigQuery. The code was written in Python and we leveraged AWS Lambda to process files from S3 regardless of data format. One challenge we faced was BigQuery’s inability to automatically include file metadata during uploads. To solve this, we modified the Lambda function to add important metadata like ingestion times and filenames before the data was loaded into BigQuery.

Overcoming Ingestion Challenges

Several technical hurdles needed to be addressed:

  • Creating a Versatile Ingestion Tool: We needed a tool that could adapt as well as Snowflake’s Snowpipe but also capable of handling multiple directories within an S3 bucket. Our solution was to design a flexible schema that could dynamically map directories to the appropriate BigQuery tables.
  • Handling Various File Formats: To manage different file formats, especially JSON. We made the decision of dumping it into a JSON field in BigQuery and handle the rest downstream with dbt, instead of transforming the JSON into columns within the lambda.
  • Maintaining Efficient Parallel Processing: We adopted the SNS-fan-out strategy to keep our processing capabilities in line with those of Snowflake. The strategy allowed us to use a single S3 bucket notification to efficiently support multiple processes, such as Snowpipe and our Lambda function. Also taking advantage of the existing terraform code, all we had to do was to create a SNS and change the configuration of the Snowpipe.
  • Setting up Monitoring, Alerting and Error Handling: Implementing monitoring and error management was crucial, especially when using serverless Lambda as our ingestion tool. We established detailed logging and alarms within AWS to track and alert us to any issues during ingestion. Furthermore, we also configured dead-letter queues for events that failed multiple times, and leveraged AWS EventBridge Pipes to resend these messages once any code or configuration issues were resolved.

Phase Two: Backfilling Historical Data and Automating the Migration

Strategies for backfilling data

We used different approaches depending on the size and source of the data:

  • Using BigQuery Omni for Smaller Datasets: For datasets under 20GB, BigQuery Omni allowed direct access to S3 data, much like creating a “Stage” in Snowflake but from BigQuery.
  • Transforming Data into Parquet for deprecated and streaming data sources: For older and streaming data sources, we converted the data into Parquet files, which were then stored in Google Cloud Storage and later loaded into BigQuery.
  • Direct S3 Loading for Large Datasets: For larger datasets, we chose to load the data directly from S3 to BigQuery, which allowed us to handle significant bigger volumes effectively. The downside here is that we were not able to include ingestion time or the file names but as it is mostly used for incremental models in dbt it was an acceptable solution for backfilling data.

To manage the migration smoothly, we wrote scripts to check the size of the data in Snowflake tables which helped us decide whether to use BigQuery Omni or direct S3 loading. Adapting the table schemas from Snowflake to BigQuery’s formats was an pivotal part of this process.

Phase Three: Optimizing dbt Models and Queries

Reducing Costs with BigQuery

When we moved from the on-demand model to slot reservation model in BigQuery, it reduced our operational costs. It is particularly beneficial for managing predictable, ongoing data workloads and offers significant savings. Moving from paying for query size towards paying for compute time of the queries.

During the SQL translation from Snowflake to BigQuery, we not only adapted the queries but also optimized them. These improvements cut down on computation times, enhancing the efficiency of our nightly data processing tasks.

Pros and Cons of Our Setup

Pros:

  • Scalability and Flexibility: Using serverless Lambdas allowed each event to be processed independently. Furthermore, by having the configuration with Terraform it is easy to add or remove data sources which in turn simplified overall management.
  • Monitoring, Alerting and Error Handling: With the use of AWS stack we receive logging and alerts to Slack Channels whenever a error occurs. In addition, by leveraging EventBridge pipes, we could easily rerun failed events after identifying the issue and coming up with an appropriate solution.

Cons:

  • Cross-Cloud Data Management: Managing data across AWS and GCP added complexity and increased costs. While the Lambdas incurred some expenses, the majority of the costs came from cross-cloud data transfer fees, which were significantly higher than the computing costs.