Referential data integrity with filancore streams
Here is the story of two devs who venture into the realm of office door locks that need an auditable privacy-friendly access log. And how life would be much easier if they had known filancore streams already….
Content browse
﹥Setting the stage: chaining accesses
﹥Interlude: Bad timing
﹥The troubling closed door
﹥The thing about electricity
﹥What if Charlie leaves the company
﹥Damage control
﹥Dave and the GDPR
﹥Combining all that into filancore streams
Alice and Bob are devs, the kind of nerdy devs who love tech challenges. But once in a while they are confronted with projects that seem so easy and yet so uncomfortable to implement. Their latest project is a prime example of that: a customer wants new RFID door locks for their office doors that produce an auditable access log.
It’s Friday, so they just want to jot down some requirements. Detecting RFID chips is something they have done hundreds of times, for that part they will just copy points from an old requirements sheet. Leaves the annoying regulation part, the access logs for which the customer expressed some vague requirements:
- Each door opening must be recorded and saved on the device.
- If the door is not closed within 1 minute after opening, there is to be another log entry.
- All log entries must be sent to another entity which stores them.
- An auditor must be able to reconstruct all door accesses from the access log the entity stores.
- There must be a way to detect if the entity tampers with the log entries in contrast to what the device stores.
Setting the stage: chaining accesses
On a first glance, the requirement about reconstructing all door accesses seemed to be too easy to implement: simply send each message to some server of our contractor Eve and be done with. However, that wouldn’t be quite tamper-proof, Alice remarked. Bob sighed and agreed, Eve could do two things they didn’t want her to be able to: change message contents and drop messages at her discretion.
Fortunately, Bob had a simple solution for the latter: we can regularly ask Eve which messages we sent her and check that against the device storage. Alice disagreed, the device should run for more than a day after all and at some point we don’t want all that chatter of door locks on the network.
So they turned to the first problem: avoiding content changes. Being devs, a common solution seemed to present itself here: using cryptographic signatures over all log entry fields. Great, problem solved. And, while we are at it, if Eve cannot change the content of a message, it gets even better because we can solve the second problem by including something unique to the previous message in the next one. That way, Eve can still drop messages but not silently anymore.
Now there’s only the question of what unique property to include there. Fortunately, we have to log the access time anyway, so the timestamp is unique (a door cannot unlock twice in the same instant, right?).
Nice and simple, they thought, now we basically have a chain of log entries describing the accesses. Should be good to go.
Summary of current state:
Interlude: bad timing
Alice’s train of thought was interrupted when she received two phone notifications in the same moment. Hmph, she thought they had such a nice design. But if two notifications can happen at the same time, why can’t two doors unlock at the same time too? “Bob”, she said, ”we’ll have to change our unique reference point”.
Two doors opening at the same time would allow Eve to just swap the entries. Or even drop one of the entries completely and replace the other one by a clone. So we’ll have to include a few more metadata that make an entry unique to the door, the time … and the user (because users are unpredictable and who knows if someone opens two doors at once).
Luckily enough, the user’s access is already recorded, the time too, so each door lock just needs some kind of unique identifier like its built-in serial number and we have enough data for uniqueness. Building a unique reference out of all that is as easy hashing the entry-identifying fields. Bad timing, these phone notifications, but another problem avoided upfront.
Summary of current state:
The troubling closed door
After that exhausting first draft, Alice and Bob decided to take a stroll to the coffeemaker. They passed an alarm-secured emergency exit. Bob stopped and asked “Wait a minute, a door like that is not opened for really long periods. How would our door lock leave an auditable trail if noone has activated it during its whole deployment? Would that be different from Eve dropping all messages?”
Alice thought about it. That looked to be quite a challenge. They could of course just regularly send a log entry “has not been accessed in the last day” to Eve. But she doubted an auditor would like these entries. There must be a better way. Maybe they could anchor the first and last access or some commission/decommission entry by sending it to someone else than Eve?
That sounded promising, but where to put it? Let’s just ask Charlie from ops for some kind of immutable database to send it to and have it run somewhere in the company. It’s just two events per deployed door lock, right? Should be manageable.
That just leaves the question of what this additional anchoring implies. Fortunately, not much. The chain starts at the first anchor on Charlie’s machine which is referenced in the first access log entry on Eve’s machine, then the chain of access log entries referencing their predecessors on Eve’s machine, and the “completeness” anchor on Charlie’s machine again (which references whatever entry sent last: the last message on Eve’s machine or the first anchor sent to Charlie).
And for the door that inspired this question, the one that basically never opens, we simply have no details whatsoever on Eve’s machine, just the two entries on Charlie’s machine, and Charlie is a good guy we trust to have backups. Phew, catastrophe avoided, time to grab that coffee.
Summary of current state:
The thing about electricity
When Alice and Bob reached the coffeemaker, they noticed someone pulled the plug. Probably one of those tea addicts who needed the socket for a kettle. While Alice plugged the coffeemaker back in to prepare her coffee, she thought aloud “Do our doors have a backup power supply?” Bob sighed and answered what she already knew, no, they of course did not. And losing power in a bad moment meant losing track of the previous entry. At least all the other entries are persisted so they only would have to worry about losing one access log entry. Seems as if they would not run out of issues to think about over their coffee.
Hence, the two of them filled their cups and went back to the drawing board. How do we ensure that after an outage, we can resume in a way Eve still cannot tamper with the data? Granted, losing one verifiable access is bad enough but losing the whole chain over this would be awful. An awfully hacky idea pops up: how about always referencing the first anchor on Charlie’s machine in each log entry and then putting a summary of all the log entries we have written in the last anchor on Charlie’s machine? Something like
We avoided all the roadblocks we previously stumbled upon, right? Entries are still signed and reference an anchor Eve cannot influence. They are all referenced by another anchor, so Eve cannot drop any entries without us noticing. When encountering a power outage we don’t need to know the definite last message but can just continue appending to the first anchor. Check.
What if Charlie leaves the company
Looking at the previous graph, there is one problem though: the first and the last anchor are necessary for any external verification of the entry chain. If Charlie’s machine breaks, that’s bad. Both Alice and Bob trust Charlie but he’s getting close to retirement and his successor Trudy has been known to have lost a few important files in the past. Slowly running out of coffee, Alice and Bob decide to take a step back and reconsider the anchoring.
Luckily enough, they find that Charlie’s machine does neither store many entries nor do we have any sensitive data here, right? Why don’t we just choose a verifiable distributed storage here? Something like IPFS or even a distributed ledger? Especially for auditability, using a ledger sounds quite valuable.
In any distributed storage location, we need to consider that we do not fully trust the storage location anymore, so better be safe than sorry and add a signature to our anchors as well. After all, we don’t want to risk our most important metadata of the log entry chain to provide less verification paths than the log entries themselves. And we even know how these anchor references look like now: they are just addresses on the ledger.
Summary of current state:
There are a few side questions at this point, though: Neither Alice nor Bob really considered the storage costs on a ledger, especially of that sizable last anchor. And they have not settled for a specific ledger either. However, their basic idea stands, so let’s flesh this out when problems occur.
Damage control
This has already become one of the longest Fridays Alice and Bob have worked in the last years but they closed in on a concept, so they weren’t going to just drop it till Monday. Back to the coffeemaker to prepare the last sprint to a solution. While still plugged in this time, the coffeemaker insisted on showing a “needs maintenance” screen. Alice turned to Bob and asked, “Bad timing again but maybe another inspiration. Our door locks can also run into some maintenance mode or irregular state and prematurely stop sending messages, can’t they?”
In contrast to one of the first ideas where we could at least identify the log entry chain up to a certain point, even though Eve may have tampered with it, in the current state we solely rely on the last anchor to tell us which entries are actually guaranteed to be part of the log entry chain. A single point of failure. And the whole audit trail depends on it. We should really reconsider that. Maybe it would even be possible to return to our log entry chain concept and utilize the ledger-based storage a bit more?
Wait a moment, Bob remarked, this seems to be awfully similar to what we have already solved with the last anchor, isn’t it? We want to guarantee the state at a certain point. So why don’t we regularly create intermediate anchors for our entries that wrap up all the messages we know of up to that point? That way, if the device fails at any given point, we can
- verify the whole chain up to that last intermediate anchor just as we can verify the chain up to the last anchor with all the guarantees we need, and
- verify the chain of all log entries after that last intermediate anchor without the guarantee of Eve dropping some messages from the end.
While we are at it, we can even apply another previous idea and chain the anchors on the ledger. That way, we don’t have to include all the messages we know of at that point but only those that were added since the last anchor.
Alice was stunned. That seemed to be pretty elegant, despite the feeling that they were building a blockchain verifiable by a blockchain on a blockchain. But who cares, this design slowly felt like something that could actually satisfy the customer’s needs.
Summary of current state:
Dave and the GDPR
Now that they have a design, Alice and Bob decide to rubber-duck it with their colleague Dave. As they have never been in office at a time Dave had not, Dave was basically considered part of the office inventory. So they took their whiteboard and knocked on his door.
Dave liked what he saw. Unfortunately, he’s also the poor soul who has to process the customer requests regarding privacy and security. In the privacy department, he actually found the implementation lacking two important GDPR principles:
- No leaking of personally identifiable information: The ledger’s storage should be considered public. Especially the intermediate anchors publicly associating the number of entries with the door can be used to track the whereabouts of a single person. That’s particularly troubling for offices with restricted access.
- Expiration of retention periods: The current concept only allows verification of the chain if one has access to each and every log entry and anchor to check the hashes, signatures, and chaining. Deleting part of the data creates a hole in the verification because the previous entry references cannot be checked anymore.
Therefore, he proposes two small alterations:
- Intermediate anchors store only the cryptographic hash of all log entries since the last intermediate anchors, the range identifiable by the timestamps.
- Allowing a first anchor to reference a previous last anchor by hash of its data so that you can verify data up to a certain last anchor and then store its hash to replace the whole chain up to that point, allowing you to delete all the data.
Alice and Bob agreed that these would actually work in their favor. They’d just have to adjust the operating model of their door lock to routinely create a new chain of log entries and reference the previous last anchor. But given that the door lock would have to store all data that were not deleted elsewhere anyway, this seemed like a good solution.
Summary of the final state:
Combining all that into filancore streams
Alice and Bob were proud of how their Friday design looked at this point. Time to save the work and head home. Maybe, had they known about filancore streams, they would have saved themselves working longer than necessary on a Friday evening and just built their pipeline on top of a reliable solution for referential data integrity.
In short, filancore streams provide a very similar design to what has been presented here:
- A verifiable chain of cryptographically signed messages stored in a database.
- filancore records to anchor these chains on the distributed ledger IOTA using hashes of the data, even more private than the anchors discussed above.
- Guarantees about
- the order the messages were added to the chain,
- the completeness of the chain (similar to the purpose of last anchor above)
- the integrity of the messages (i.e., no modification of the signatures, similar to the content-based hashes above),
- the creator of the messages by using self-sovereign identities to manage signing key material, and
- the time a messages has been received from the device.
If you are interested in a tailored solution providing these guarantees for data integrity in your use case and environment, we would be happy to hear from you.