Hello Everyone,

I’ll admit a pet obsession, I’m sort of obsessed with the rivalry between Databricks and Snowflake. There’s nothing quite like it in tech. It’s early June, and Snowflake’s Data Cloud Summit just finished.

Subscribe now

🧱 Databricks vs. Snowflake ā„ļø

Even how Databricks and Snowflake describe themselves has changed in the era of Generative AI in recent years, no actually, in recent months! Databricks is a global data, analytics and artificial intelligence company founded by the original creators of Apache Spark.

The company provides a cloud-based platform to help enterprises build, scale, and govern data and AI, including generative AI and other machine learning models. Snowflake is an American cloud computing–based data cloud company based in Bozeman, Montana. It was founded in July 2012 and was publicly launched in October 2014 after two years in stealth mode. I’d cautiously say both companies are worth around $43 billion but have incredible potential. Databricks was founded one year after Snowflake, in 2013. Ten years later they made their most important AI acquisition to date.

In July, 2023 Databricks made a key decision in AI, when they agreed to acquire MosaicML. Snowflake has just acquired Neeva in May. This would set both companies on a path to become more AI native to attract customers and to embed AI increasingly into their businesses. Snowflake has fiscal first-quarter revenue of 33% year over year. What’s clear is the AI Data Cloud is going to become a big deal.

You might recall, Nvidia is one of the investors in Databricks.

Both are trying very hard to position themselves as being AI native as they compete and as we wait for Databricks to go public (IPO).

Snowflake very nearly recently in 2024 acquired Reka AI, a promising foundational LLM builder recently. But the deal fell through at the last moment.

Meanwhile, Databricks on June 4th, 2024 announced that they are acquiring Tabular, a small startup that helps companies optimize data they store in the cloud with the Apache Iceberg format.

What’s interesting to me is that Snowflake and Confluent were also bidding on Tabular. Both seem willing to make important acquisitions as they scale. I believe the winner between them will become a very important technology company in the 2030s. Currently I’m leaning slightly to Databricks. How they are evolving in open-source Generative AI and continue to provide more B2B enterprise value is also part of this story.

I asked the super talented of the Strategy Deck to dive into Databricks for us. Alex is a consultant that can help with strategy, product, market research, market strategy and planning for growth. I highly recommend her work:

Hire Alex

Product strategy

Roadmap management

Market research and analysis

Cross-functional team management

As Databricks goes IPO, I’ll be watching their AI integrations even more closely. My best guess is Databricks goes IPO in early 2026.

By , June, 2024.

The emergence of AI in the past year has increased the importance of unstructured data and of the tools needed to store, manage, transform and analyze it. Business data is the fuel for training and fine-tuning models for the entire spectrum of enterprise use cases and it is also an essential part of AI-based applications, especially when it comes to RAG systems.Ā 

Therefore, unstructured data management and analytics companies, such as Databricks, have become even more important for the AI ecosystem and have been growing together with it. But AI is not the only growth driver. In the past few years, Databricks has excelled at building around it an ecosystem of partners and contributors, as well as the associated network effects and the advantages that derive from them.

Network effects – the benefits gained from new users joining a platform and the competitive advantages that derive from the network – are core elements of marketplace companies, but they are not exclusive to them. B2B SaaS platforms are also building communities and marketplaces within or alongside their products that grow and maintain their business ecosystem.Ā 

This post is a look at Databricks’ product strategy and the types of network effects they are building around data management tools – an approach that helped the still private company get to a reported US$43B valuation as of September 2023, with US$1.6B revenue in the fiscal year ending in January 2024.Ā Ā 

In a market that is virtually unlimited, as more digital data of all kinds is being created, stored and shared between companies every day, building and using network effects as a competitive advantage is valuable in the short- and long-term.Ā 

PRODUCT PORTFOLIO

Databricks provides a comprehensive platform for the entire data lifecycle, from ingestion and processing to analytics and machine learning. The core infrastructure is made up of the lakehouse architecture, which can handle both unstructured and structured data, on top of which the company offers a series of data processing, analytics and visualization tools. Within the lakehouse, the Delta Lake is the open source storage framework that provides the format for tables and operations. On top of the data is the Unity Catalog, the interface and permissions system to discover, describe, audit and govern data assets managed within Databricks.Ā 

Alongside the core offering is a collection of scaling and optimizations frameworks, as well as integrators and connectors to cloud computing providers (e.g.Azure, AWS) and 3rd-party analytics tools.Ā 

They include:

Databricks SQL – for SQL queries and visualization

Delta Engine – an optimizer for queries

Delta Live Tables – an ETL framework

Delta Sharing – open protocol for sharing with 3rd-parties

And in the past year, Databricks has focused on growing its collection of machine learning products. Alongside MLflow, the company offers tools to customize, fine-tune and build AI applications, as well as foundation models.

The AI tool catalog includes:

Databricks Assistant – an AI-based coding helper, currently in Public Preview

Mosaic AI – collection of tools to build, deploy and monitor AI models

MLFlow – ML lifecycle management

Databricks Workflows – orchestration for data, analytics and AI

Databricks Notebooks – IDE for data and AI projects

Foundation models – the DBRX and Dolly series

DBRX was published in March 2027 and is a General Purpose, mixture-of-experts model with 132B total parameters, of which 36B are active on any input. It was pre-trained on 12T tokens of text and code, with a 32k context window and it was designed to contain 16 experts and choose four of them when prompted. DBRX is available under the Databricks Open Model Acceptable Use Policy and it scores 68.9% on ARC-Challenges, 89% on HellaSwag, 73.7% on MMLU (5-shot), 66.9% on GSM8k and 70.1% on HumanEval, according to its technical report.Ā 

Dolly 2.0, released in April 2023 is a 12B, open source parameter model based on pythia from Eleuther AI and fine-tuned for instruction following.Ā 

Databricks has two major sources of network effects: the open source development model and the data marketplace.

OPEN SOURCE NETWORK EFFECTS

Databricks has a significant history developing in the open and is a strong contributor to open source, from its technological foundations using Apache Spark to more recent products, such as MLflow and Delta Sharing.Ā 

Open source development creates many advantages for a company, including increased innovation and collaboration with experts across fields and geographies, enhanced transparency and security in the products, as well as the creation of a network of contributors, which benefits from network effects, which, in turn, improve the technology and the products that are being built.Ā 

There are four main types of network effects:

Direct same side, where people performing similar activities interact with each other for their mutual benefit

Direct cross-side, where people who contribute in complementary ways to the network interact with each other

Indirect same side, where people who perform similar activities benefit passively from each other’s presence and contribution to the network

Indirect cross-side, where people on complementary sides of the network benefit passively from the presence and activities of people on another side

In open source, direct and same side network effects accrue when each new volunteer who writes or reviews code increases the value of the technology and of the community for the other contributors. This happens because each new member brings her expertise and technical knowledge. More diverse and advanced skills enable better learning experiences and collaboration opportunities for the other developers to write even more innovative, performant and secure code.Ā 

And since open source communities also include volunteers and contributors who translate, evangelize, govern and market the technology, they also benefit from the network effects in similar ways to code contributors.Ā Ā 

Another type of direct same side network effects enabled by open source are social, as members of the community create professional connections, which help them later in getting different jobs or collaborating outside of the project where they met.Ā 

The development of technical standards, such as for web technologies, is another benefit of direct and same side open source network effects, where developers from different companies come together to establish common syntax and performance standards, which in turn promotes technical transparency, interoperability and safety for the entire technology.Ā 

There are also direct cross-side network effects enabled by open source. Users of the technology or derived products have available diverse support communities, where they can get their questions answered. And if they need a particular feature to be built or bug fixed, they can reach out directly to the developers and advocate for their needs, thus adding to the functionality present in a product.

And other direct cross-side network effects accrue to companies who build products and services using open source technologies in the form of reduced maintenance and development costs.Ā 

These, and more, are network effects and advantages that companies who contribute to open source, such as Databricks, benefit from, directly and indirectly.

DATA MARKETPLACE NETWORK EFFECTS

In April 2023, Databricks launched a marketplace for data consumers and providers to share data sets and assets, such as ML models, notebooks, applications and dashboards. Based on the open Delta Sharing protocol, the marketplace benefits from the network effects inherent inĀ  2-sided marketplaces that arise from the interdependence of the supply and demand sides.Ā 

The most important one is the direct, cross-side effect of the value increase of the marketplace for users on one side as the number of users on the other side grows. The more data providers there are, the more valuable the market becomes for data consumers. And vice versa. Growth on one side fuels the other one, creating a virtuous circle that becomes a competitive advantage for the company managing the platform.Ā 

The Databricks marketplace is aimed at three use cases: data monetization, data sharing with partners or suppliers and sharing with internal lines of business. Its value proposition is that it offers an open solution to securely share data independent of cloud providers and computing platforms. And data stored in open source formats, such as Apache Parquet and Delta Lake can be shared without replication or physical movement. Additionally, it offers Clean Rooms that enable safer, more private sharing of sensitive data.

OPEN SOURCE AND B2B NETWORK EFFECTS IN AI

Open source development has been an important driver for AI for a long time, as significant frameworks such as PyTorch and TensorFlow, as well as the Apache Spark and Delta technologies have been built in the open. And the recent growth in ML models that are accessible, to various degrees of openness, through Hugging Face, further augment the importance of community-based, collaborative development.Ā 

Databricks has been founded on open source and continues to adopt and extend such technologies while participating in the networks enabled by this type of development and their network effects. And, at the same time, is investing in the data marketplace, with its own virtuous circle of growth and supply and demand sides.Ā 

The rapid growth of AI in the past year has been driven by tremendous technological innovation and a considerable part of it has been in the open and has happened through communities of data scientists, engineers and practitioners, collaborating and exchanging ideas to benefit users of AI applications and the world at large.Ā 

Biography and Market Strategy Consultant

An accomplished Product Manager and Strategist, Alex is available to accelerate your product’s growth by developing comprehensive product strategy, overseeing end-to-end product development, and managing cross-functional teams to ensure successful product launches.

Additionally, Alex delivers expert market research and analysis services, providing deep insights into market trends, competitive landscapes, and customer preferences.Ā  [the below image is clickable]:

Read MoreĀ in Ā AI SupremacyĀ