Note: This blog post and the accompanying photograph are contributions from our developer Minh Ha.
The unique challenges that KYC service providers faced while integrating with the blockchain became noticeably apparent when integrating our first KYC service provider with the Polymesh Incentivized Testnet (ITN).
Onboarding a user to the Polymesh network involves a series of steps that KYC service providers need to perform after verifying the user’s identity document:
- Register the user’s hashed identity data with PUIS in exchange for a uID
- Assign a new on-chain distributed identity (DID) to the user’s provided wallet address
- Generate and assign an association between the above uID and DID, in the form of an on-chain CDD claim
- Provide the user with their newly generated uID to store in their Polymesh Wallet (this allows the user to prove their real-world identity when signing transactions on-chain).
During integration with our first service provider, these steps were left to be implemented by the service provider themselves. This proved to be not only challenging for service providers but also difficult to scale for the entire network for a number of reasons. It required service providers to be familiar with both Polymesh and PUIS and required a high complexity on their behalf to implement a sequential order of operation not prone to race conditions and recoverable from failure events. Integrating multiple service providers meant there was also a high potential for inconsistent implementation if service providers were left to implement on their own.
To take on the above responsibility, the Onboarding Integration Service was introduced.
Introducing the Onboarding Integration Service
The Onboarding Integration Service exposes a single REST endpoint for service providers to call. This reduces the need for them to implement the above-mentioned steps themselves and instead focus on their core competency of verifying users’ real-world identities.
To accomplish this, the Onboarding Integration Service introduces the concept of a “CDD application”. Users first start their onboarding process by logging in to the Polymesh onboarding page (onboarding.polymesh.network) with their email address. Here, a user will submit a new CDD application with a wallet address and service provider of their choice. A deep-link to the service provider containing the CDD application ID is then sent to the user’s email, which the user can click on and follow to start their KYC process. From here, service providers are responsible for verifying users’ real-world identity documents and associating the result with the CDD application ID contained in the deep link.
Once verified, service providers simply submit the users’ hashed identity data along with the CDD application ID to the Onboarding Integration Service. If successful, our service will enqueue a new processing job internally and respond immediately with a success message. At this point, the service provider’s role is complete and processing will begin in the background. Once processing is finished, the Onboarding Integration Service sends the user an email with a link to retrieve and store their uID. Because the only identity data received by the Onboarding Integration Service is a set of hashed data, our system never actually has access to the users’ real-world identity.
The technical execution
From early on, it was determined that a background task queue was an important requirement for the system. This was due to the need for:
- Signing on-chain transactions sequentially
On-chain operations need to be performed with a signing key as transactions, but because there can only be one transaction per key at any given moment, all on-chain operations need to be grouped by signing key and performed sequentially. The first service provider, who was implementing their own process, found this especially challenging. Since Polymesh needs to support multiple signing keys (at least one per service provider), the task queue needs to support partitioning to avoid operations done by different signing keys (either belonging to different service providers, or the same service provider using different keys) blocking each other. However, it also needs to maintain sequential operations for tasks performed using the same key. - Recoverability in failure events
During the lifetime of the service, there will likely be failures in execution for any of the above-mentioned 4 steps. This may be due to any number of reasons, such as network failure, signing keys running out of funds, and so on, as well as unforeseen circumstances that may require manual investigations. It is therefore important to be able to capture the last state of a CDD application accurately in case of failure and pick up where the failure occurred once the issue has been resolved.
Given the above requirements and a tight deadline, we decided to adopt a tech stack powered by AWS Lambda, SNS, SQS, and Postgresql on RDS, deployed through the Serverless framework.
One Amazon SNS Topic is used as the backbone of the message queue, configured to fan out to multiple SQS queues, each responsible for one of the 4 steps above. Each SQS queue is observed by one Lambda which picks up the events to perform the associated step. A central queue determines the next appropriate step and writes the next event to SNS and the cycle continues. This allows us to satisfy the two requirements in the following ways:
- Signing on-chain transactions sequentially
When a service provider submits the identity hash, the associated CDD application ID begins processing and is assigned the service provider’s signing key (this would have been previously registered as part of onboarding a new service provider). The application is then propagated through the pipeline as SNS and SQS messages, where MessageGroupId values are set to be the associated signing key (which is encrypted when serialized to ensure data security). MessageGroupId satisfies the above-mentioned need for sequential operations with partitioning per signing key. To prevent duplication, MessageDeduplicationId was also used to ensure that each CDD application only has one step performed at any given time. - Recoverability in failure events
All SQS queues are configured with automatic retries. In the event of failure, the system will automatically retry twice for a total of up to 3 attempts. During short-time outages that may result from network issues, each step is able to recover on its own without intervention from operators. However, if all three attempts fail, the event falls back to a dead-letter SQS queue where it would be persisted in raw JSON to Postgres. This allows us to investigate the cause of failure and resolve any issue, after which the events are simply fed back into SNS to be picked up where left off.
What’s on the horizon
As Polymesh continues to grow, we will likely see more KYC service providers joining the ecosystem. Although much of the system is provider agnostic, some custom development will likely need to take place. If a KYC provider is not able to reach our system due to network issues, the logic to recover from these outages needs to be implemented per provider. When onboarding a new provider, the process of setting up their signing key is still done manually. These are examples of opportunities for improvements that we will continue to refine to further reduce frictions for KYC service providers joining the Polymesh ecosystem.
Keep an eye out on our blog for more stories surrounding how we built Polymesh. Curious about something specific? Join our developer community to let us know what you want to hear next!