I recently undertook the migration of an on-premise Cloudera Kafka cluster to a Confluent Cloud dedicated cluster. I would like to share some of our experience on how we migrated Accounts and Permissions to Confluent Cloud. Even though they are both Kafka clusters, their implementation of access management is quite different and an as-is migration was imposible.
Background
Cloudera
Cloudera & Hortonworks Kafka distributions are typically configured with Kerberos as an authentication method, therefore a KDC needs to be supplied. Generally, the clusters are kerberized against a corporate directory such as Active Directory or RedHat IDM (FreeIPA), and thus rely on an external directory system to manage authentication. These corporate LDAP directories contain user account information and application accounts. Leveraging the corporate directory makes sense. as it is a great asset. Integrating with a corporate directory has a few, yet important advantages:
- Authentication is fully managed and secured by Kerberos;
- Corporate policies such as password complexity and expiration have been implemented;
- The account provisioning process is usually supported by an established process and tool(s);
- Account lifecycle is usually well managed and taken care of.
In the Cloudera Kafka distribution, authorization is taken care of by Apache Ranger. Typically, you would sync LDAP accounts and groups into Ranger via the Ranger usersync service, and then assign some authorization on services. Each Big Data service Ranger plug-in (e.g. HDFS, NiFi, Hive, Kafka) brings their own specific Access Control List (ACL) and implementation specifics. For example, in the case of Kafka Topic authorization, also called permissions in Ranger, we can configure them to allow or deny up to eight permissions:
- Publish;
- Consume;
- Configure;
- Describe;
- Create;
- Delete;
- Describe Configs;
- Alter Configs.
It is best practice to implement a security model based on groups rather than individual accounts, it makes for a more flexible solution, and one that is also easier to maintain.
I would like to point out that until Ranger v2.1.0 Consumer Group ACLs were not managed specifically in Ranger. Control of the authorization for a topic was managed at the topic level only, for instance by assigning the 'Consume' permission.
Note: Consumer Group ACLs need to be specifically managed in Confluent Cloud, or consumers of a topic will not be able to read from topics.
Confluent Cloud
Accounts were substituted for API keys in Confluent Cloud. The API key/secret pair is generated either using the ccloud
CLI or the Cloud Console. The API key and secret are generated locally in the control plane and cannot be modified after creation. There is no way to integrate back to an authentication system. Authentication and authorization is managed locally in Confluent Cloud.
Integration back to a corporate directory is not supported for topic security. This is quite different from the SSO/SAML integration supported for the Confluent Cloud UI or ccloud
CLI. The UI, or the CLI are used for administrative tasks in your cloud instance only.
If API keys are not linked to a confluent service account with --service-account
, the API keys are granted full access to all resources in the cluster. This is a nice feature when testing the platform, however when a multi-tenant cluster is implemented with users and various applications, security needs to be more granular.
If you need to restrict the scope of an API key, it will need to be attached to a service account on which ACLs can be applied. Confluent Cloud supports a subset of the Kafka ACL normally implemented with an on-premise implementation.
The supported Confluent Cloud Topic Kafka ACLs are:
- Alter;
- AlterConfigs;
- Create;
- Delete;
- Describe;
- DescribeConfigs;
- Read;
- Write.
As you can notice, there are no 'Consume' ACL just like in Ranger. As stated above, granular permissions have to be managed at the topic and consumer group level in order to authorize a consumer of a topic.
Solution
We understand that Cloudera and Confluent Cloud Kafka provide very different authentication and authorization methods. However for customers that have invested in a multi-tenant security model based on LDAP groups; the implementation of a similar model with Confluent Cloud would makes sense if it were available. The solution would need to apply security automatically just like Ranger, it would also need to rely on LDAP groups to apply ACLs just like with Ranger. If these requirements were met, we can hope to minimize the migration efforts and risks since the "security rules" can be automatically applied to the new Confluent Cloud cluster as applications and users are migrated over to the cloud.
We decided to use Active Directory as the reference system to construct the tool that will be used to automate the account creation and authorization in Confluent. We will take full advantage of the current groups, accounts and membership relationship information that currently drives authorization in Cloudera Kafka and transpose it to the Confluent Cloud cluster. The goal is to fully automate the implementation of the current security model based on groups into Confluent Cloud and thus provide a smooth transition to the new platform. No manual adjustments should be necessary. It needs to be AD group driven only.
The glue, confluent_AD_sync.py
was developed in Python. At every run, it will fetch AD relationships, pull in the Confluent configurations (service accounts, API Keys, ACL, clusters, environments), and determine what needs to be executed on a cluster so that Confluent Cloud stays in sync with the reference system being AD. Similar to Terraform, it can be executed in plan
mode or in execution
mode. At every run, it will analyze on how to go from the current state of the cluster to the desired state.
If users are added to "Kafka" groups in AD groups, on the next run, these users will be added to Confluent Cloud, and their topic ACL will be adjusted accordingly. If these users are removed from a given group, their ACL will be adjusted, and they will no longer be granted access to the topic(s). If any manual adjustments are performed in Confluent Cloud with the CLI (except for a few exlcusion rules we have set), confluent_AD_sync.py
will bring it back to the desired state by either adding or removing configurations within the cluster. Of course, any adjustements are first planned and then executed if desired. We have implemented failsafes so that a human can verify the plan and execute it, if it is appropriate.
There are two type of execution plans, the first one is for user accounts, the second one is for application accounts. Their execution plan is handled very similarly, however, there are a few differences especially on the handling of the secret. These differences are described in the following sequence diagrams.
Design for User Access to a Cluster
With application accounts, there are a few differences on how the accounts are handled, most of the time, application accounts do not have a valid email account or an active email account, we therefore decided to dump the data directly into a vault. We see two benefits to this mechanism, first, the API Key and Secret are secured in a safe place for the application owner. Second, this enables and promotes the use of CI/CD pipelines in application deployments. The password can either be used directly by an application or handled by a pipeline. Typically the pipeline would copy the credentials in its own specific application vault.
Design for Application Accounts Access to a Cluster
Discussion on Relationships between AD and Confluent
In order to keep things simple, we maintain a one-to-one relationship between an AD account and a Confluent Cloud service account, the service account name is named after the AD sAMAccountName
field in order to link back to the AD member. The service account name is limited to 32 characters.
We like to see and handle the service account a bit like a group, it regroups ACLs across clusters and contains a key to access these resources. Service Accounts are not specific to a cluster, but are common across environments (multiple clusters can be created in a given environment).
sAMAccountName: applicationA
Service_Account_Name: applicationA
Even though it is possible to link multiple API keys to a single service account, we maintain a one-to-one relationship between a service account and an API key, this simplifies traceability and audit. Essentially a given user can only authenticate with a single account.
Active Directory sAMAccountName: applicationA
ccloud Service Account:
Name: applicationA
id: 123456
ccloud API-Key:
key: <key>
owner: 123456
description: '{ "name" : displayName, ... }'
For tracability on the service accounts and API keys, we built a common JSON description that matches the user's Active Directory information. This information is pushed into the desription field of the service account or API key. This acts as meta data for further processing on service accounts and API keys. You are limited to 128 characters in the description field of the service account. As you can see below, we have 4 fields in the description, the displayName, the email, a timestamp, and a creator.
description = json.dumps({ \
'name': ADDetails[member]['displayName'], \
'mail': ADDetails[member]['mail'], \
'created': whenCreated, \
'creator':'confluent-AD-sync'})
Relationship diagram between Active Directory Account and ccloud
As the plan is built and executed, these relationships are enforced and maintained, any manual changes or deviations from the desired state are corrected to maintain these relationships.
Discussion on AD Groups
We have to understand that the current implementation of the security model provides access to topics based on Active Directory group membership. A given group membership provides access to a set of related topics.
Let's start the discussion with an example, for instance, if a user is a member of a group called DEV.HR.SALARY.READ
the user will be granted READ
access to topics with the PREFIX HR.SALARY.*
within the Kafka DEV
cluster. If this same user is removed from the DEV.HR.SALARY.READ
group, his ACL will be revoked. If the user is removed from all Kafka groups, his service account, ACLs and API key will be deleted from the cluster.
As you can see, our group naming convention provides all the necessary information to enable automation. Let's elaborate a little on the group naming convention, <environment>.<department>.<topic description="">.<acl>. The first field stands for the Kafka environment on which to apply the access to, the second field describes the department name, the third field describes the content of a topic and the last field describes our standardized ACL field, we have READ, READ WRITE or READ WRITE EXECUTE.
If you need to implement something similar, a descriptive group nomenclature helps in providing insight into what members should be granted.
Now that we have a standardized naming convention for our groups, querying Active directory is quite straightforward, we just need to consider nested groups.
Ranger does not support nested groups; therefore, the security model had to be "flattened" in order to accommodate for this drawback. This is usually a hassle and a constraint for your Active Directory administrator. In order to provide flexibility with the model, we decided to implement the support for nesting.
When querying for membership with the ldap module, you will want to handle nested groups, Microsoft provides a special string to add to the filter just for this. It will walk the chain of ancestry in objects all the way to the root until it finds a match. This can be a bit slow, but it is a known issue with this Microsoft LDAP feature. For faster queries but not integrated with Python, you can use PowerShell, it is quite fast.
ldap_memberof_filter = 'memberOf:1.2.840.113556.1.4.1941:'
ldap_user_obj_class = 'user'
ldap_group = 'DEV.HR.SALARY.READ'
ldap_member_filter = "(&(objectClass={})({}={}))".format(ldap_user_obj_class, ldap_memberof_filter, ldap_group)
If you want to implement this in Redhat IDM or FreeIPA, you will need to handle the nesting logic in Python so to walk the chain of groups until you find all the members. Circular references are not permitted in IDM, but I would advise to limit the depth of your queries.
Discussion on AD Accounts
In order to efficiently manage accounts in Confluent, a few LDAP attributes need to be pulled in. We've discussed the sAMAccountName
field, this field is used as the service account name. The displayName
attribute is used in the description field. Of course, the email address is used in the description field but also to communicate the user credentials.
One more field needs to be considered, that is UserAccountControl
. This field contains the account property flags, for instance 0x202
describes a normal account, if the account is disabled, we need to add 0x002
(0x002
+ 0x202
= 0x204
= 514
in decimal). This field is very important if you want to manage the account lifecycle. A good discussion on the UserAccountControl
flags can be found here.
Confluent Cloud Control Plane Interface
Now that we understand the Active Directory groups and the relationship with service accounts, API keys and ACLs, we need to interface with Confluent Control Plane in order to create and manage accounts. Unfortunately, as of this writing, there are no API available for the Control Plane. The only way to interface with the cloud platform is either through the CLI or the Web Interface.
Fortunately, the CLI is pretty straightforward and does support an output format of yaml or JSON. This makes the output easily parsable and transformed into a Python object.
try:
stream = subprocess.check_output(["ccloud service-account list -o yaml"],shell=True)
except subprocess.CalledProcessError as e:
<handle the error>
...
# output in yaml and converted to a list
service_accounts = yaml.load(stream.read(), Loader=yaml.FullLoader)
In our case, we simply created a set of functions that the plan will call upon in order to bring the cluster back into a desired state. We can create objects:
- Create an api-key;
- Create a service account;
- Add a topic ACL;
- Add a group ACL;
- Add a cluster ACL.
We can also delete objects:
- Delete a service account;
- Delete topic ACL;
- Delete group ACL;
- Delete a cluster ACL.
The plan will always compile what needs to be executed first and then execute de code against the control plane in the right order.
Below is a sample of some of the CLI commands that needs to be run in order to create the confluent objects. This is just a sample output of how everything is knit together with the CLI.
## Create the Service Account based on AD (Service accounts are valid accross Kafka clusters in a given environment)
ccloud service-account create <sAMAccountName> --description "<json description>" -o yaml
id: <id>
name: <sAMAccountName>
## Grant Access based on group ownership (ACLs are valid across Kafka clusters in a given environment)
ccloud kafka acl create --allow --service-account <id> --operation "READ" --topic <topic prefix> --prefix
## Create the api-key (Valid for one cluster only)
ccloud api-key create --service-account <id> --resource <cluster ID> --description "<json description>" -o yaml
key: <key>
secret: <long secret>
Considered Alternatives and Final Thoughts
For the authorization portion of things, an alternative to confluent-AD-sync.py
could have been the development of a Ranger integration with Confluent Cloud.
The native Ranger usersync service would have taken care of syncing the proper Kafka groups from Active Directory; the UI interface would have allowed for a granular and flexible authorization assignments. The only piece missing would have been the interface with Confluent Cloud.
Simply put, this could have been developed by parsing and understanding the Ranger policycache found in /etc/ranger/<cluster_name>/policycache/kafka_<cluster_name>.json
or by the use of the Ranger API. From this authoritative JSON source, we would be able reconstruct the group membership and its relationship to topics, and apply or delete the corresponding ACLs.
Of course, the account provisioning process would still have to be taken care of somehow.
Conclusion
We understand that maintaining the source of truth in Active Directory has advantages. AD is usually at the core of the enterprise, and mature processes are in place for identity and access management. We feel that anytime we can lean on a reference system, it's a plus within the organization.
confluent-AD-sync.py
takes advantage of LDAP as a source system and ensures that Confluent Cloud is kept in sync with the account provisioning and authorization. Of course, there are drawbacks to such an implementation, but the automation and consistency it provides greatly outweighs the disadvantages of this integration.