Prescient are the entrepreneurs who predicted data would become the new oil, like Ali Ghodsi, Andy Konwinski, Ion Stoica, Matei Zaharia, Patrick Wendell, Reynold Xin, and Scott Shenker. They’re the cofounders of Databricks, a San Francisco-based company that provides a suite of enterprise-focused scalable data science and data engineering tools. Since 2013, the year Databricks opened for business, it’s had no trouble attracting customers. But this week kicked into high gear the company’s uninterrupted march toward market domination.
Databricks this morning announced that it’s closed a $400 million series F fundraising round led by Andreessen Horowitz with participation from Microsoft, Alkeon Capital Management, BlackRock, Coatue Management, Dragoneer Investment Group, Geodesic, Green Bay Ventures, New Enterprise Associates, T. Rowe Price, and Tiger Global Management. It values the startup at $6.2 billion up from a post-money valuation of $2.75 billion in February (following a $250 million funding round), and it comes shortly after Databricks’ revenue run-rate reached $200 million (during Q3 2019) and after a year in which annual recurring revenue grew 2.5 times year over year.
It also announced the hiring of Dave Conte as CFO, who served as Splunk’s CFO for eight years. He’ll lead all of the financial and operational functions for the company, reporting directly to CEO Ghodsi.
Databricks now counts among its customer brands like Hotels.com, Viacom, HP, Shell Energy, Showtime, Riot Games, Sanford Health, Expedia, Condé Nast, McGraw Hill, Zeiss, Cisco, NBCUniversal, Overstock, Nielsen, HP, Dollar Shave Club, and more across the advertising, technology, energy, government, financial services, health care, gaming, life sciences, media, and retail segments. Ghodsi says that in total, data teams at more than 5,000 organizations are using its platform today, which amounts to over double the number of organizations (2,000) Databricks reported in 2019.
“[We’re] the fastest-growing enterprise software cloud company on record. Our bets on massive data processing, machine learning, open source, and the shift to the cloud are all playing out in the market and resulting in enormous and rapidly growing global customer demand,” added Ghodsi, who said that Databricks will set aside a €100 million ($110 million) slice of the series F proceeds to expand its Amsterdam-based European development center over the next three years. (He claims the center has already grown by three times over the past two years.) Other near-term plans involve bolstering the company’s operations in Europe, the Middle East and Africa, Asia Pacific, and Latin America, as well as its workforce of 600 employees spread across major offices in Amsterdam, Singapore, and London.
Databricks was founded by the original creators of Apache Spark, an open source distributed general-purpose cluster-computing framework developed atop Scala at the University of California, Berkeley’s AMPLab, and it chiefly develops web-based tools for orchestrating deep learning, machine learning, and graph processing workloads. The company’s suite interfaces with Spark’s over 100 operators for data transformation and manipulation, and they provide automated cluster management and virtual notebook environments for real-time collaborative programming.
Databricks’ data science workspaces provide environments for running analytic processes and managing machine learning models, supplemented by interactive notebooks that support multiple languages including R, Python, Scala, Java, and SQL, plus libraries and frameworks like Conda, XGBoost, Google’s TensorFlow, Keras, Horovod, Facebook’s PyTorch, and scikit-learn. Interactive point-and-click visualizations (and scriptable options like matplotlib, ggplot, and D3) come standard out of the box, as do orchestration tools enabling users to develop, deploy, and monitor models from a centralized repository to container services from Altyerx, Azure, Amazon, DataRobot, and Dataiku.
Those notebooks, speaking of, support things like coauthoring, commenting, and automated versioning, in addition to real-time alerts and audit logs for troubleshooting. They’re able to kick off machine learning pipelines automatically or pass along data to Tableau, Looker, PowerBI, RStudio, SnowFlake, and other platforms, and the results from them can be exported to notebooks in popular formats like HTML and IPYNB (notebook document).
As for Databricks’ data analytics and unified data services products, they’re built on a Spark-compatible layer from the Linux Foundation — Delta Lake — that sits atop existing data lakes and uses Apache Parquet (a column-oriented data storage format) to store data while capturing snapshots and tracking commits. (Databricks says it can handle petabyte-scale tables with billions of partitions and files.) Analytics workspaces remain private across thousands of users and data sets and can be audited and analyzed by administrators, who also have the power to manage infrastructure and impose restrictions and organization-wide budgets.
Databricks also develops MLflow, an end-to-end open source platform for machine learning experimentation, validation, and deployment, and Koalas, a project that augments PySpark’s DataFrame API to make it compatible with Pandas. One component of MLflow — MLflow Tracking — records and queries experiments, while others — MLflow Projects and MLflow Models — provide platform-agnostic packaging formats for reproducible runs and a general format for sending models to deployment tools.
Perhaps the crown jewel in Databricks’ product portfolio is the Databricks Runtime, a processing engine built on an optimized version of Apache Spark that runs on auto-scaling infrastructure. With it, users can restart, create, or terminate Spark clusters and reconfigure or reuse resources as needed. Additionally, the Runtime affords freedom over which version of Spark is used to run jobs, as well as over the job scheduler and notifications of production job starts, fails, and completes.
A subcomponent of the Databricks Runtime is the Machine Learning Runtime, which is generally available across Databricks’ product offerings and which provides scalable clusters that include popular frameworks, built-in AutoML, and performance optimizations. Features run the gamut from a library of prebuilt containers, libraries, and frameworks (e.g., XGboost, numpy, MLeap, Pandas, and GraphFrames) and model search using MLflow to a simple API (HorovodRunner) for distributed training. Databricks claims that the Machine Learning Runtime delivers up to a 40% speed-up compared to Apache Spark 2.4.0.
“No other company has successfully commercialized open source software like Databricks,” said Andreessen Horowitz cofounder and general partner Ben Horowitz. “We’ve all witnessed the strong evolution of Apache Spark as the standard for big data processing. Not surprisingly, we continue to see open source innovation from this team, with Delta Lake, MLflow and Koalas.”
A report by Market Research Future pegged the big data analytics market value at $275 billion by 2023, and Gartner recently predicted that AI-derived business revenue will reach $3.9 trillion in 2022. With that sort of capital at stake, it’s no wonder investors are throwing their weight around to the tune of hundreds of millions of dollars. Analytics service provider Fractal Analytics raised $200 million in January, months ahead of end-to-end data operations platform provider Unravel’s $35 million series C fundraising round. Not to be outdone, business analytics startup Sisense nabbed $80 million to expand its offerings last September.