Back to Blog

Sagemaker or Vertex AI?

Machine learning has been a strong selling point for cloud vendors for the past years. Perhaps this is why both AWS and GCP have put significant effort into developing their ML platforms:  Sagemaker and Vertex AI. These platforms were launched only a few years ago, but since then, they have been continuously evolving and expanding, adding so many features that it’s almost impossible to track the new releases announced by each. 

So how can you choose between the two? In this blog post, I’ll take you through the major fundamental differences between GCP’s Vertex AI and AWS’s Sagemaker to hopefully assist you in making this strategic decision. Here are a few of the crucial points, in my opinion, to consider when evaluating the platforms.

Don’t choose based on what’s currently available

As mentioned above, cloud vendors are continuously expanding the capabilities of their systems. Consequently, potential users that often read that one platform supersedes the other due to some killer feature that it has and the other doesn’t may find out a few months later that the competitors have been secretly working on their own version of this feature. These gaps get closed regularly. Adding to this, as organizations mature their ML practices and roadmap over time, a requirement that could have seemed crucial when choosing the platform may not be so important in the future. This is why I DON’T think that generating a “grocery list” of Sagemaker and Vertex AI features should be crucial for choosing the platform. Especially when this decision will affect your organization for the long haul. 

Having said that, it’s a good idea to look at these two blog posts by Alex Chung and Vineet Jaiswal that did a great service to all in listing and comparing the features of each platform. They make it easy to see that both platforms are very capable, with many overlapping features that will allow you to do just about everything on either platform. And if there is a missing feature nowadays? Just wait a few months for the other platform to catch up.

Machine learning needs data

The first point in which I feel that there is a difference between Vertex AI and Sagemaker is actually not officially a part of either platform. Machine learning and AI require a good data infrastructure, and in this criteria, Google Cloud seems to come on top. Google’s data offering has more advanced tools, and they integrate well with Vertex AI and BigQuery, one of the leading data warehouses out there. On the other hand, AWS customers that I meet often seek data solutions outside of the native AWS ecosystem like Databricks or Snowflake. This difference is more apparent in tabular data use cases and not necessarily in dense data cases like images, videos, or audio. 

Therefore, for organizations and individuals that need an out-of-the-box data platform, GCP has an edge.

Serving models in production

Practicing ML in different organizations can take many shapes and forms. While some organizations may focus more on research or training models, others focus on building production-grade ML integrated applications. Even though model hosting only serves a small part of the entire ML pipeline, Sagemaker’s hosting solution gives an advantage to organizations that need to manage their models in production. This was one of the early features of the platform and seems to have been well planned out from the beginning. Sagemaker implements DevOps best practices such as canary rollout, connection to the centralized monitoring system (CloudWatch), deployment configuration, and more. Sagemaker also offers cost-efficient hosting solutions such as Elastic Inference, Serverless Inference, and multimodal endpoints. 

GCP, on the other hand, does less of a successful job. The separation between deployment and deployment configuration is not as clear, and the native support for different algorithms are narrower. 

Ease of use

I acknowledge that this paragraph will probably be the most controversial in my comparison since user experience is hard to measure, especially between two different platforms with some parts that don’t even overlap. However, here I found GCP to have a clear advantage over AWS.

Despite all the advances made in machine learning tools over the past few years, ML is still hard to implement, especially at production-grade quality. Adding to this a shortage of ML and MLOps engineering talents, a platform’s user experience becomes an amplifier for the team’s productivity. And let us offer some examples:

  1. GCP resource views are global. It’s not important in which region the Notebook instance is running. Vertex AI has only one page, showing all the Workbench (Jupyter Notebook) servers. I can only imagine how uncomfortable it is for a user to find out that they accidentally launched an expensive GPU server in a godforsaken region and did not shut it down for days.
  2. Sagemaker notebooks are not accessible via SSH. As strange as it may sound, the notebook instances cannot be remotely SSHed into. This doesn’t allow comfortable usage like remote code execution with an IDE. 
  3. Sagemaker notebook instance types and VPCs cannot be changed after launch. Worked on a notebook, and now you need to resize it? Impossible. You need to copy the data to a new notebook with a new configuration. Additionally, when shutting down a notebook the libraries that were installed during the run are lost and need to be reinstalled.
  4. Sagemaker studio – a fork of Jupyter Lab that AWS extended to be a full ML dashboard has in my opinion, become too complicated. The studio is supposed to be a management tool; however, you basically need to launch it because it’s running on a server instead of it being serverless or maybe just a part of the AWS console.

I have some more examples, but you probably get the message – GCP is more intuitive to use, in our opinion, saving your developers expensive time.

Pricing

Price is, of course, one of the most important considerations when choosing your ML Platform, but comparing it can be very tricky. Why? Because it’s almost impossible to compare apples to oranges. Apart from the machine itself, each platform offers different cost optimization products (Savings Plans, CUDs). The performance itself can change based on features that aren’t related to the machine directly, like network and so forth. Nevertheless, we did sample 4 typical use cases:

  1. Notebook server with a low spec machine 
  2. Training with CPU-balanced machine
  3. Inference with a high spec CPU machine
  4. Notebook with a V100 GPU

All pricing refers to the main US regions: AWS Ohio and GCP us-central-1.

The results were a bit surprising:

GCP ProductGCP Pricing / Hour
AWS Product
AWS Pricing / Hour
e2-standard-4$0.13ml.t3.xlarge $0.20
ML management fee$0.20
Total$0.33$0.20

Notebook – low spec instance (4 vCPU, 16 GB RAM)
GCP ProductGCP Pricing / Hour
AWS Product
AWS Pricing / Hour
n2-standard-16$0.89ml.m5.4xlarge $0.92
ML Units multiplier$0.18
Total$1.07$0.92


Training – Balanced spec (16 vCPU, 64 GB RAM)
GCP ProductGCP Pricing / Hour
AWS Product
AWS Pricing / Hour
n1-standard-8$0.44ml.m5.2xlarge $0.46
$0.44$0.46


Inference (8 vCPU 32 GB RAM)
GCP ProductGCP Pricing / Hour
AWS Product
AWS Pricing / Hour
n2-highmem-8$0.52ml.p3.2xlarge$3.83
ML management fee$0.40
GPU$2.48
Total$3.40$3.83
Notebook – 1 GPU spec instance (8 vCPU, 64 GB RAM, 1x v100)

There were major differences between GCP and AWS not only in the prices but also in the way that they are structured. AWS seems to offer better low spec pricing with their t3 burst instances that can be good for running notebooks for experiments. GCP offered a lower cost for inference and GPU instances. However, the GCP pricing was very complicated to understand

Pricing structure

AWS simplified the process of ML services pricing by creating machine types with the “ml” prefix. AWS users know that p3.2xlarge for example, is a 8×61 machine with Tesla V100. with the “ml” prefix the users can find the equivalent of using this machine with the Sagemaker system. GCP, on the other hand, complicated the pricing. For example, model training pricing is based on Consumed ML Units – besides a note saying that this number is based on the hardware used and the time it was used, I couldn’t find a clear explanation of how many Consumed ML units would 1-hour training be translated to for all cases. Additionally, I had to scan many pages of pricing to come up with the cost of a jupyter notebook server.

Different hardware offering 

One thing that GCP had done better was the ability to generate custom combinations of GPU / CPU. With GCP, it’s possible to change the GPU/CPU ratio; therefore, if I want a machine with more CPUs, it doesn’t mean that I have to use more GPUs as well.

Additionally, GCP and AWS both have unique hardware offerings. GCP offers the Google-branded TPUs – specifically designed for deep learning tasks and compatible mostly with TensorFlow – these machines can drive down cost for high throughput computations. AWS offers Elastic Inference – which gives users access to a fraction of GPU, letting them pay less for sparse usage.

Best of breed or best of suite?

While both platforms offer a vast set of tools as part of their suite, just like any other Cloud Platform – it’s never complete. In many cases, cloud users are required to use 3rd party tools to complete their native cloud spec. PagerDuty for alerting, DataDog for monitoring, or Snowflake for data warehousing. 

This holds true for ML monitoring as well and is largely influenced by the maturity of the organization’s ML and MLOps activities. As organizations mature and scale in their use of ML, simplified ML monitoring may no longer be sufficient to ensure the health of their models in production. The domain of ML monitoring is usually split into two parts: data monitoring and model monitoring. And while these two concepts are crucial for the health of deployed ML applications, the offering of both Cloud Platforms will often send you to rely on a 3rd party tool like Superwise. 

Model monitoring 

AWS and GCP both present an approach that is closer to endpoint monitoring rather than ML model monitoring. The two platforms essentially offer a way to create a monitoring job that tracks one of the ML endpoints that’s deployed to them. However, when monitoring ML models’ performance, it’s not necessary to couple between an endpoint and the monitoring job. All that Sagemaker and Vertex AI had to do was develop a service that runs over the ground truth, predictions, and timestamps. Sagemaker is indeed doing so, but they are not providing clear enough documentation on how to monitor models that are deployed outside of the ecosystem. GCP is even more lagging. AWS, however, does allow to configure an endpoint to save its calls to S3 quite easily, allowing you to integrate external ML monitoring tools with some minor effort involved. 

Data Monitoring

GCP based its data monitoring solution on the popular TFX open source library. Allowing users to leverage the tool, including its UI, to evaluate data drifts, feature skews, and more. AWS uses its own library called deequ, which is less popular but demonstrates similar capabilities. Both platforms still don’t seem to have a good architecture for this solution, making the capabilities for data monitoring limited. In my humble opinion, AWS wins some more points here for their solution because it allows better integrations with the monitoring processes.

So, which platform should I choose?

First, you need to remember that choosing an ML platform is also a question of choosing a cloud vendor, and to this, there are many more points that we didn’t cover, like Kubernetes, support, cost credits, stability of the infrastructure, and more. Both have many advantages, and they both keep expanding their capabilities. The major differences that I found can be summarized as follows: GCP feels easier to use, while AWS seems more ready to take to production. 

Second, while it’s worth following the advances in each platform, it doesn’t look like either of the two can cover all the requirements of an ML platform. The limited model monitoring offering example shows that integration with 3rd party services (like done with many other cloud use cases that are not ML) will continue to be a requirement from the systems. So in case you chose an ML platform and did not get everything that you needed from it, keep in mind that

Ready to get started with Superwise?

Head over to the Superwise platform and get started with easy, customizable, scalable, and secure model observability for free with our community edition.

Prefer a demo?

Pop in your information below and our team will show what Superwise can do for your ML and business.