Back in February, I had an idea for a series of posts on a startup’s minimum viable data stack. The motivation was to provide advice that’d help startups navigate the ever-expanding MAD (Machine Learning, AI, and Data) landscape, which as of 2024 includes over 2,000 logos. I was also aiming to deepen my knowledge through research and writing for the series.
Since then, I’ve decided to drop the idea because:
- Specific constraints are key. A startup’s decisions on hiring, infrastructure, and requirements constrain what’s minimally viable in their case.
- MAD is a moving target. The MAD landscape is a moving target even when restricted to “minimum viable” data stack components.
- Deep research requires real-world experimentation. I was planning to use every tool as part of my research, but this is both time-consuming and shallow. I simply don’t have the capacity to use every tool to the level it’d be used by a full-time team (no single person does).
I’ll illustrate these issues through a question I advised on recently:
Should a startup use Snowflake, Databricks, or a collection of AWS products at the core of their Data & AI stack?
(1) Specific constraints are key
The question is constrained to three top-level options. This was informed by the company’s choice of AWS as their cloud platform. While Snowflake and Databricks can run with either of the three major cloud providers, throwing Google Cloud or Azure into the mix would be foolish.
Indeed, when the time comes to get serious about Data & AI, it’s often the case that many infrastructure decisions have already been made. These decisions introduce constraints – which is a Good Thing. Such constraints narrow down the search through the MAD landscape.
The question also implies that the company would benefit from a cloud lakehouse solution. This isn’t always the case, especially in a minimum viable setting! Sometimes, spreadsheets are good enough. In other cases, the increasingly-popular DuckDB can serve analytical needs. Determining what’s minimally viable requires understanding the specific context.
Finally, the context also includes current staff capabilities and plans for hiring. Continuing with our example, setting up Databricks or cobbling together AWS products requires more data engineering expertise than getting started with Snowflake (at least in 2024). The preferred choice has to be informed by team expertise, which limits the usefulness of general advice posts.
(2) MAD is a moving target
One Redditor distilled the question of Snowflake versus Databricks in February 2023 to a choice between: "(1) a product made to be a cloud RDBMS-style SQL engine focused on ELT and data collaboration, which is now adding data engineering workloads as a bolt-on; or (2) a product made to be a datalake-based Spark engine for distributed computations and engineering/data science workloads, which is adding a SQL database as a bolt-on."
The collection of AWS offerings is also ever-evolving, but anyone who’s used AWS products deeply knows that they’re not all equal in quality. For example, I was wondering whether AWS Redshift has caught up with Snowflake and Databricks, given that it now includes a serverless option. However, Redshift still isn’t getting much love – both from personal discussions with experts and from Reddit comments like: “We use Redshift extensively, and I would take my chances and pick an unknown solution over Redshift at this point”.
Check again next year and things might change. General recommendations based on current capabilities will get old pretty quickly.
(3) Deep research requires real-world experimentation
Vendors tend to paint a rosy picture of the everyday reality of setting up, managing, and using the products they’re selling. For complex tools, going through a basic setup on a sample project isn’t enough to gain the deep understanding that comes from real-world deployments. For an illustration of the practical knowledge gained by making many such decisions, check out the post (almost) every infrastructure decision I endorse or regret after 4 years running infrastructure at a startup.
Further, when it comes to Data & AI platforms, there are multiple users with different views. For example, how an analyst interacts with the platform is different from how a data engineer interacts with it. My original thought of trying to cover all bases in a post series was just too ambitious.
As a shortcut to gaining deep experience with all the tools (an impossible endeavour), I opt for a combination of desk research and consulting experts. Returning to the running example, this sort of research means that I can now speak more confidently about the trade-offs between choosing Snowflake, Databricks, and AWS tools in 2024. The short answer to the question of which Data & AI platform is a big IT DEPENDS (and it will keep changing).
Beyond the tools: People, Platform, and Processes
Tools cover only one aspect of making technology decisions – the Platform. People and Processes are no less important. Keeping the Platform constant, some companies will succeed, and some will fail.
In general, the best tools are those that your People already know, or are committed to using well. Given the richness of the MAD landscape and software tooling in general, making reversible decisions and choosing boring, well-supported components where possible is the way to go.
While I’ve abandoned the idea of posting about each component in the minimum viable data stack, I still advocate for the core concepts: Work back from business needs, set up the minimal Platform that makes sense in your case, and ensure that your People and Processes are making the most out of the chosen Platform. Then keep iterating.
Good luck out there!
Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.