Spark Dataframe Metadata

SIRIGIRI HARI KRISHNA
Towards Dev
Published in
2 min readDec 26, 2022

--

Spark Dataframe is structurally the same as the table. However, it does not store any schema information in the metadata store. Instead, we have a runtime metadata catalog to store the Dataframe schema information. It is similar to the metadata store, but Spark will create it at the runtime to store schema information in the catalog.

We have two reasons for storing schema information in the catalog.

  1. Spark Dataframe is a runtime object –You can create a Spark Data frame at runtime and keep it in memory until your program terminates. Once your program terminates, your Dataframe is gone. So, it is an in-memory object.

2.Spark Dataframe supports schema-on-read –Dataframe does not have a fixed and predefined schema stored in the metadata store. We load the data into a Dataframe and tell the schema when loading the data. And Spark will read the file, apply the schema at the time of reading, create the Dataframe using the schema and load the data.

--

--

Data Engineer passionate about Spark, Azure, and the Cloud. Simplifying data complexities on my Medium blog. Let's dive into the world of data together!