Ask AI

You are viewing an unreleased or outdated version of the documentation

Tutorial, part one: Intro to Dagster#

Dagster is an orchestrator that's designed for developing and maintaining data assets, such as tables, data sets, machine learning models, and reports.

In this tutorial, you'll analyze activity on the popular news aggregation website, Hacker News. You'll fetch data from the website, clean it up, and build a report that summarizes some findings. You'll then tell Dagster to occasionally update the data and the report, which Dagster calls assets.


Introduction to asset definitions#

Before building a Dagster project, you'll first learn about Dagster's core concept: the asset definition.

An asset is an object in persistent storage that captures some understanding of the world. Assets can be any type of object, such as:

  • A database table or view
  • A file, such as in your local machine or blob storage like Amazon S3
  • A machine learning model

If you have an existing data pipeline, you likely already have assets.

Asset definitions are a Dagster concept that allows you to write data pipelines in terms of the assets that they produce. How they work: you write code that describes an asset that you want to exist, along with any other assets that the asset is derived from, and a function that can be run to compute the contents of the asset.

Here's an example. The asset is a dataset called topstories, and it depends on another asset called topstory_ids. topstories gets the IDs computed in topstory_ids, then fetches data for each of those IDs.

@asset(deps=[topstory_ids])
def topstories() -> None:
    with open("data/topstory_ids.json", "r") as f:
        topstory_ids = json.load(f)

    results = []
    for item_id in topstory_ids:
        item = requests.get(
            f"https://hacker-news.firebaseio.com/v0/item/{item_id}.json"
        ).json()
        results.append(item)

        if len(results) % 20 == 0:
            print(f"Got {len(results)} items so far.")

    df = pd.DataFrame(results)
    df.to_csv("data/topstories.csv")

Building a DAG of assets#

A set of assets forms a DAG (directed acyclic graph), where the edges correspond to data dependencies between assets. This DAG helps you to:

  • Understand how your assets relate to each other
  • Empower you and your teammates to act, learn, and debug your pipelines

Your DAGs are viewable in Dagster's web-based UI, shown below:

Viewing the asset graph in the Dagster UI
Assets enable the best long-term approach for building pipelines in Dagster and this tutorial focuses on them. However, developers can use Dagster's underlying Ops and Graphs when granular workflows are needed. For more on Ops and Graphs, refer to their tutorial and a guide on how asset definitions relate to ops and graphs.

Next steps#

In this section, you learned how asset definitions are the building blocks of Dagster and also about the benefits of asset-based pipelines.

Within your current data pipelines, you likely have many assets. Or if you are earlier in your data journey, you might have an understanding of what data your stakeholders need.

When going through this tutorial, think about what benefits you'll be receiving by defining your data assets in Dagster.

In the next section, you'll install Dagster on your computer and start a new project.