Meet Lakeflow, the Superhero Your Data Pipelines Deserve

Let's be honest, working with modern data systems is really hard and stressful.

You have data coming from many different places, jobs that need to run at specific times, and fragile data scripts that barely work. In the middle of all this mess, you still need to build reliable data systems that work well.

This is where Databricks Lakeflow comes in. It's a new tool announced at the Databricks Data+AI Summit 2025 that makes data engineering much easier and less complicated.

Think of it as a combination of different useful tools all working together, with easy connections to other systems and automation built in. You tell it what you want, not how to do it. You can see your data flow visually. It follows data engineering best practices out of the box. And it even has AI that helps you write your code, like a really smart autocomplete.

In this blog post, we'll explain what Lakeflow is, why it's important, how it's different from Delta Live Tables, and why your data team will probably love it.

What Exactly Is Lakeflow?

Imagine if your data pipelines could:

Build themselves with AI suggestions,
Flow seamlessly from ingestion to orchestration,
Come with 20+ native connectors out of the box,
And never make you stare at a broken DAG at 2 AM again.

Databricks Lakeflow is a one-stop shop that stitches together all your messy data engineering needs into a single, smooth, declarative experience.

Show Stopping Features of Lakeflow

1. Lakeflow Designer – Code Meets Canvas: Lakeflow Designer gives you a visual editor to drag and drop to design pipelines, Dual-mode editing – switch between code and canvas-, Git integration, and versioning baked right in. Think: VS Code meets Miro board meets Spark.

2. Lakeflow Connect – Plug Into Everything: Out-of-the-box managed connectors with CDC support for: Salesforce, Oracle, SQL Server, SharePoint, SAP, Kafka, MongoDB. And yes, even Snowflake and Redshift (the irony is not lost on us). Built-in Unity Catalog support means all ingested data is instantly governed, tracked, and secured.

3. Lakeflow Jobs – Orchestration with Brains: Your brittle Airflow scripts can retire now. Lakeflow Jobs offers: Branching, looping, and conditional logic out of the box. Trigger-based execution (on files, tables, or time), Serverless compute, and full observability with alerts. Bonus: It doesn’t choke on scale.

4. Lakeflow Pipelines: Lakeflow Pipelines simplify the creation and management of batch and streaming pipelines by building on the declarative Delta Live Tables framework. They let you focus on business logic using SQL or Python, while Databricks handles orchestration, incremental processing, and autoscaling. With built-in data quality monitoring and a Real-Time Mode, Lakeflow enables low-latency data delivery—all without needing extra code.

5. Zerobus – Blink and Your Data is There: Lakeflow introduces Zerobus, a lightning-fast ingestion backbone: < 5-second latency, up to 100 MB/sec throughput, Streams data directly into Delta Lake with zero setup. It's the bullet train your streaming ETL dreams of.

6. AI-Assisted Authoring – Like Copilot, but for Pipelines: Don’t remember the syntax for .option("checkpointLocation")? Lakeflow’s got you. Smart autocomplete, Inline recommendations, AI-generated pipeline code snippets (from natural language prompts). Spend more time thinking about logic, not boilerplate.

7. Governed from Day Zero: No more data anarchy. With Unity Catalog and Lakehouse Monitoring: Every dataset, transformation, and job is tracked. Lineage is automatic, Access policies and audits are consistent across the board.

How Lakeflow Components Work Together: The Complete Picture

Understanding Lakeflow requires seeing how its three core components, Lakeflow Connect, Lakeflow Pipelines, and Lakeflow Jobs, integrate to create a unified data orchestration platform. Let's explore how these pieces fit together to power modern data architectures.

The Three Pillars of Lakeflow

Lakeflow Connect: The Data Gateway Your bridge to the outside world, connecting to external systems, databases, APIs, and SaaS applications.
Lakeflow Pipelines: The Transformation Engine Built on Delta Live Tables (DLT), this is where your declarative ETL/ELT magic happens.
Lakeflow Jobs: The Orchestrator Manages workflow execution, scheduling, dependencies, and monitoring across your entire data platform. Example data flow with the components:

External Systems → Lakeflow Connect → Lakeflow Pipelines → Data Products → Lakeflow Jobs

Lakeflow = DLT + Airflow + Connectors + AI

Lakeflow doesn’t just unify the data engineering stack, it gives it a fresh coat of automation, visibility, and elegance. Whether you're wrangling messy source systems or orchestrating multi-hop data pipelines, Lakeflow’s got your back with power, polish, and peace of mind.

Quick Primer: What is Delta Live Tables (DLT)?

Before Lakeflow was introduced, Delta Live Tables (DLT) was Databricks' main tool for making data pipelines easier to build and manage.

Understanding DLT in Simple Terms

Think of DLT as a smart system that helps you create reliable, professional-quality data pipelines. You only need to write SQL or Python code, you don't have to worry about complex scheduling systems or managing computer servers and infrastructure.

DLT simplified a lot of complicated work that data engineers typically had to do manually. Instead of writing hundreds of lines of code to move and transform data, you could describe what you wanted to happen, and DLT would figure out how to make it happen. To know more about DLT, follow this link.

The Limitations of DLT

Even though DLT made life much easier for data engineers, it still had some important limitations:

Manual data ingestion: You had to write your own code to bring data in from external sources like databases, APIs, or cloud storage systems.
No built-in CDC support: CDC (Change Data Capture) is a technique for tracking changes in databases. With DLT, you needed external tools to handle this.
Limited visual design: You mostly worked with code. There wasn't a drag-and-drop visual interface for designing pipelines.
Basic orchestration: Coordinating multiple pipelines or complex workflows often requires additional tools like Apache Airflow.

Declarative Pipelines in Lakeflow: Simplified Data Engineering

Lakeflow Pipelines are declarative, meaning you only define what your data pipeline should do or result using SQL or Python, Databricks takes care of the rest. No need to manage orchestration logic, cluster configs, or retries. Lakeflow automates:

Data flow dependencies,
Incremental updates.
Streaming vs batch handling,
Compute autoscaling.

This approach dramatically reduces boilerplate, improves reliability, and makes pipelines easier to build, monitor, and maintain.

Traditional Imperative Approach vs The Lakeflow Declarative Approach

Following is an example of the traditional imperative approach vs the Lakeflow declarative approach.

Traditional Imperative Approach

from pyspark.sql import SparkSession

from pyspark.sql.functions import

from delta.tables import DeltaTable

STEP 1: Read Raw Data

raw_customers = spark.read.format("json").load("/data/raw/customers/")

STEP 2: Data Cleaning

Manual data quality checks

cleaned_customers = raw_customers.filter(

    (col("customer_id").isNotNull()) & 

    (col("email").isNotNull()) &

    (col("email").rlike("^[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+$"))

)

Remove duplicates

cleaned_customers = cleaned_customers.dropDuplicates(["customer_id"])

Add metadata

cleaned_customers = cleaned_customers.withColumn("processed_at", current_timestamp())

Write to staging table

print("Writing cleaned data...")

cleaned_customers.write \

    .format("delta") \

    .mode("overwrite") \

    .option("mergeSchema", "true") \

    .saveAsTable("staging.cleaned_customers")

STEP 3: Read Orders Data

raw_orders = spark.read.format("parquet").load("/data/raw/orders/")

Clean orders

cleaned_orders = raw_orders.filter(

    (col("order_id").isNotNull()) & 

    (col("customer_id").isNotNull()) &

    (col("order_amount") > 0)

)

cleaned_orders.write \

    .format("delta") \

    .mode("overwrite") \

    .saveAsTable("staging.cleaned_orders")

STEP 4: Enrichment

Need to re-read the tables we just wrote

customers_df = spark.read.table("staging.cleaned_customers")

orders_df = spark.read.table("staging.cleaned_orders")

Join and aggregate

enriched_customers = customers_df.alias("c").join(

    orders_df.alias("o"),

    col("c.customer_id") == col("o.customer_id"),

    "left"

).groupBy(

    "c.customer_id",

    "c.email",

    "c.first_name",

    "c.last_name"

).agg(

    count("o.order_id").alias("total_orders"),

    coalesce(sum("o.order_amount"), lit(0)).alias("lifetime_value"),

    max("o.order_date").alias("last_order_date")

)

enriched_customers.write \

    .format("delta") \

    .mode("overwrite") \

    .saveAsTable("analytics.enriched_customers")

STEP 5: Customer Segments

Re-read enriched data

enriched_df = spark.read.table("analytics.enriched_customers")

customer_segments = enriched_df.withColumn(

    "segment",

    when(col("lifetime_value") >= 1000, "VIP")

    .when(col("lifetime_value") >= 500, "Premium")

    .when(col("lifetime_value") >= 100, "Regular")

    .otherwise("New")

)

customer_segments.write \

    .format("delta") \

    .mode("overwrite") \

    .saveAsTable("analytics.customer_segments")

print("Pipeline completed successfully!")

Manual error handling

try:

    # All the above code needs to be wrapped in try-catch

    pass

except Exception as e:

    print(f"Pipeline failed: {e}")

    # Manual retry logic needed

    # Manual alerting needed

Lakeflow Declarative Approach

import dlt

from pyspark.sql.functions import *

#### BRONZE LAYER: Raw Data Ingestion

@dlt.table(

    comment="Raw customer data ingested from JSON files",

    table_properties={"quality": "bronze"}

)

def raw_customers():

    return spark.read.format("json").load("/data/raw/customers/")

@dlt.table(

    comment="Raw orders data ingested from Parquet files",

    table_properties={"quality": "bronze"}

)

def raw_orders():

    return spark.read.format("parquet").load("/data/raw/orders/")

SILVER LAYER: Cleaned and Validated Data

@dlt.table(

    comment="Cleaned and validated customer data with quality checks",

    table_properties={"quality": "silver"}

)

@dlt.expect_or_drop("valid_customer_id", "customer_id IS NOT NULL")

@dlt.expect_or_drop("valid_email", "email IS NOT NULL AND email RLIKE '^[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+$'")

def cleaned_customers():

    return (

        dlt.read("raw_customers")

        .dropDuplicates(["customer_id"])

        .withColumn("processed_at", current_timestamp())

    )

@dlt.table(

    comment="Cleaned orders data with quality checks",

    table_properties={"quality": "silver"}

)

@dlt.expect_or_drop("valid_order_id", "order_id IS NOT NULL")

@dlt.expect_or_drop("valid_customer_id", "customer_id IS NOT NULL")

@dlt.expect_or_drop("valid_amount", "order_amount > 0")

def cleaned_orders():

    return dlt.read("raw_orders")

GOLD LAYER: Enriched and Aggregated Data

@dlt.table(

    comment="Enriched customer data with order metrics",

    table_properties={"quality": "gold"}

)

def enriched_customers():

    customers = dlt.read("cleaned_customers").alias("c")

    orders = dlt.read("cleaned_orders").alias("o")

    

    return customers.join(

        orders,

        col("c.customer_id") == col("o.customer_id"),

        "left"

    ).groupBy(

        "c.customer_id",

        "c.email",

        "c.first_name",

        "c.last_name"

    ).agg(

        count("o.order_id").alias("total_orders"),

        coalesce(sum("o.order_amount"), lit(0)).alias("lifetime_value"),

        max("o.order_date").alias("last_order_date")

    )

@dlt.table(

    comment="Customer segments based on lifetime value",

    table_properties={"quality": "gold"}

)

@dlt.expect("has_segment", "segment IS NOT NULL")

def customer_segments():

    return (

        dlt.read("enriched_customers")

        .withColumn(

            "segment",

            when(col("lifetime_value") >= 1000, "VIP")

            .when(col("lifetime_value") >= 500, "Premium")

            .when(col("lifetime_value") >= 100, "Regular")

            .otherwise("New")

        )

    )

These examples clearly show how Lakeflow's declarative approach reduces complexity, improves maintainability, and provides better data quality guarantees

Key Differences

Aspect	Traditional	Declarative
Code Lines	~120 lines	~70 lines
Dependency Management	Manual ordering required	Automatic based on dlt.read()
Data Quality	Manual filters	Declarative expectations (@dlt.expect)
Error Handling	Manual try-catch blocks	Built-in retry and recovery
State Management	Manual read/write cycles	Automatic optimization
Incremental Updates	Complex merge logic needed	Built-in with @dlt.table
Monitoring	Custom logging	Auto-generated metrics & lineage
Testing	Need a separate test framework	Built-in data quality assertions

Open-Sourcing Declarative Pipelines for Apache Spark

Databricks has open-sourced the declarative pipeline framework, that powers Lakeflow Pipelines, making it available to the broader Apache Spark community.

The Future of Data Pipeline Development

Databricks Lakeflow represents a fundamental shift in how we approach data engineering. By embracing declarative pipelines, data teams can move away from the complexity of imperative orchestration and focus on what truly matters: defining business logic and ensuring data quality.

Why Lakeflow Matters?

In our traditional vs. declarative comparison, we saw how Lakeflow reduced our pipeline code by nearly 40% while simultaneously improving reliability, observability, and data quality. But the benefits extend far beyond mere line count:

For Data Engineers: Spend less time debugging infrastructure issues and more time building valuable data products
For Data Teams: Achieve faster time-to-market with self-documenting, maintainable pipelines
For Organizations: Reduce operational costs through intelligent compute management and minimize data quality incidents

When Should You Use Lakeflow?

Lakeflow is ideal when you need:

Scalable ETL/ELT pipelines with built-in quality checks
Real-time streaming and batch processing in a unified framework
Strong data governance and lineage tracking
Reduced operational overhead and maintenance burden
Medallion architecture (Bronze/Silver/Gold) implementation

Final Thoughts

As data volumes grow and business demands increase, the old ways of building data pipelines simply don't scale. Lakeflow isn't just about writing less code, it's about building more reliable, maintainable, and trustworthy data systems that empower your entire organization to make better decisions.

The declarative paradigm shift we're seeing with Lakeflow mirrors the evolution we've witnessed in other areas of technology, from infrastructure-as-code to containerization. The question isn't whether to adopt declarative data pipelines, but when and how to begin your journey.

Ready to modernize your data pipelines? Start exploring Databricks Lakeflow today and experience the difference that declarative development can make.

References: