1.1 Unity Catalog- Architecture
You have a centralized Unity catalog. And it manages two things: User base and Metastore.
Unity Catalog:
The Unity Catalog acts as a central hub for managing and governing various aspects of your data and user access within the Databricks environment.
It plays a dual role by managing both the user base and the Metastore.
User Base Management:
Within the Unity Catalog, all Databricks user identities and service principals are stored and managed. This means that user authentication, authorization, and user-related information are centralized within the Unity Catalog.
User identities include individuals who interact with the Databricks platform, and service principals represent the identities used by services or applications to access resources.
Example:
If a new data scientist joins the organization and needs access to Databricks resources, their user identity is added or managed within the Unity Catalog. Similarly, if a service or application needs to interact with Databricks, its service principal information is configured within the Unity Catalog.
Metastore Management:
The Unity Catalog Metastore is responsible for storing and managing metadata related to data objects, such as databases, tables, and views.
This means that the definition, structure, and properties of these data objects are centralized within the Metastore, providing a unified view of metadata across different workspaces or projects.
Example:
When a data engineer creates a new database or table within a Databricks workspace, the metadata associated with that data object is stored in the Unity Catalog Metastore. This metadata includes information like schema, location, and access permissions.
Integration with Databricks Services:
Databricks services and components interact with the Unity Catalog to leverage the centralized user base and Metastore.
Authentication and authorization services utilize the user information stored in the Unity Catalog to control access to Databricks resources.
When users or services interact with data objects (databases, tables, etc.), Databricks components refer to the Unity Catalog Metastore to retrieve and validate metadata, ensuring consistency and coherence across the environment.
Example:
When a user attempts to access a specific table within a Databricks workspace, the authentication service checks the user’s identity in the Unity Catalog to determine whether they have the necessary permissions. Simultaneously, the metadata service queries the Unity Catalog Metastore to retrieve information about the table, such as its structure and access controls.
In summary, the Unity Catalog serves as a centralized governance solution, unifying user management and metadata management for Databricks, thereby providing a cohesive and consistent data environment across the organization. This architecture contributes to enhanced security, collaboration, and data governance.
Summary
When you connect your workspace to the Unity Catalog, it’s like linking your playground to a master control center. This control center, the Unity Catalog, becomes aware of everything happening in your workspace — your users, the tables you’re working with, and who has permission to do what.
Imagine you have different spaces where you’re doing your work, like different rooms in a house. The Unity Catalog acts as a central brain that oversees all these rooms. Once connected, every time you run a query in one of these rooms (workspaces), the Unity Catalog steps in to make sure you’re allowed to do what you’re trying to do.
For instance, if you want to select or update a table, the Unity Catalog checks its list to see if you have the green light. It knows who you are, what tables are available, and what actions you’re permitted to perform. It’s like having a very diligent assistant who makes sure you’re following the rules.
Connecting to the Unity Catalog brings a bit of magic to your work. The Spark SQL engine, which is like the engine running your data analysis processes, starts relying on the Unity Catalog for permission checks and understanding your queries. It simplifies things for you — just focus on your analysis and let the Unity Catalog handle the behind-the-scenes details, almost like having a genie that understands and fulfills your data wishes.
further article