We love telling the stories behind the daily challenges we face and how we solve them. We also love hearing about the insights, experiences and the lessons learned shared by prominent voices from the global community.
In this new series of 1-on-1 interviews, our very own Wix Engineers talk directly with some of the most inspiring minds in the tech industry. From software engineering to architecture, open source, and personal growth.
And we’re starting this series with an insightful conversation between Natan Silnitsky and Martin Kleppmann.
In different tech circles - arguably in a lot of those - Martin Kleppmann needs no introduction. Most famous for his best-selling book “Designing Data-Intensive Applications”, Martin is also currently busy working as a Senior Research Associate and Affiliated Lecturer at the University of Cambridge Department of Computer Science and Technology. A great speaker and writer, capable of explaining complex subjects with great clarity, patience and thoroughness.
Below is a collection of excerpts from the conversation between Martin and Natan - you can watch the full talk here.
Q: What would you say are the main challenges of building modern-day distributed data systems?
There are so many different tools available to choose from - and it’s difficult to figure out which tools you would use for what purpose. People say, “choose the right tools for the job”- but how do you know which tool is actually the one for your particular job?
That’s really what I tried to address in my book. People ask me if I could make a flow-chart where you ask five simple questions and it tells you “you should pick this database”. And I tried to - and realised that it just wasn’t possible to do. Because the questions you have to ask require so much background thinking.
I’m trying to teach the readers of my book which questions they should be asking of their own systems in order to figure out what aspects they need, what things they don’t need, and, therefore, which systems might be appropriate for their particular task.
Q: What kind of questions do you tend to get from readers?
One thing that comes up, for example, is that you get SQL databases - for all sorts of different things. And you might say, well, one SQL database is interchangeable with a different SQL database because they use the same query language. But actually no, you get very different databases at very optimised or totally different use cases, and that’s not apparent at all from the query language.
So one subdivision I make in the book (which like any categorisation is not perfect but at least that’s a first approximation) is that you can divide systems into transaction processing systems and analytic systems - and they tend to add very different access patterns.
The transaction processing systems will usually have lots of very small read-write transactions which each change a small number of records in the database. On the other hand, you’ve got the analytic systems, where most of your data writing is bulk loads, bulk imports from some other data source. And most of your querying is very large queries that have to scan through a huge number of different rows.
Depending on which of those two you are, you make totally different design decisions in terms of how you lay out the data on disk, how the query plan gets executed, how you distribute the database across multiple machines...
Everything changes depending on what that access pattern is. And that’s one of the high-level questions, I think, I tried to get across. And then also by showing what are the consequences of that design decision, how do you lay out the data if you wanted for a transaction processing versus how do you lay it out if it’s for analytics. The data layouts look entirely different.
Q: The book was published in 2017, any new scientific breakthroughs in this field since then?
There’s always new stuff happening. As you know, the tech industry is extremely fashion-driven, where it’s always like “Oh, we have the best, amazing, biggest new thing that’s going to change everything!”. And often, if you strip away the layers of marketing and look at the fundamentals, you realise that actually not that much has changed.
Often there’s a nugget of a new idea or two, which is kind of interesting, but it’s not really the paradigm change that people make it out to be. And because my book looks at the fundamentals and tries to strip away all of the marketing stuff, it actually hasn’t gone out of date very much.
The fundamentals are pretty slow changing - the one recent fundamental change, for example, is non-volatile memory, as we’ve gone from spinning disks to SSD and now increasingly to non-volatile memory. That’s a fundamental change but I’m not sure it’s really percolated very much into the architecture of the practical systems we use nowadays. I think it’s interesting to look at those sort of underlying trends - but also to realise that a lot of the database architecture is still essentially based on the spinning disk model from the 1970’s.
Another trend is that more companies are comfortable using cloud-based data storage systems rather than running their own systems on their own hardware. That has changed the way people interact with these systems, since some of the operational stuff is taken care of for you - back-ups, for example, are done by your cloud provider. But even if you’re just a software engineer building applications using cloud services, you still need to figure out which cloud service you would use for which task - because just like with self-hosted systems, not every service is appropriate for every kind of workload. You still need to know a bit about what is going on internally in the services so that you can figure out which service you would use in which circumstances.
I used to work in a start-up back in 2012 - we were using Heroku postgres, a cloud-hosted postgres service. And that was great - we didn’t have to do many operations ourselves. Yet pretty quickly, as our dataset grew, we realised that actually no, we did need quite a lot of insight into why a particular query was slow, why was our latency suddenly spiking at a particular time of the day...
In a cloud-hosted system you tend to not have very much insight into what's going on. If you have a small-scale application that fits easily on any type of system, then it’s not a problem. But if you’re pushing the boundaries... then even with cloud systems you still have to look “underneath the cupboards” if you want to debug why something is slow, for example.
Q: Do you think we can rely on modern databases to provide strong consistency, like linearizability?
It depends entirely on the setting that you’re running in. One example I like to use is the calendar app on your phone. That’s essentially a database containing some entries in your calendar. You probably want to sync that with your computer as well - so you’ve got copies of your calendar on your phone and on your computer.
Imagine you are in a place where you don’t have cellular data reception and you put something in your calendar. You still want to be able to update your calendar, to write to this database at any time, even if your phone is disconnected from the entire world. Which means that right after you have written a new entry to the calendar on your phone, it’s not going to be on your computer, because your phone and your computer haven’t had a chance to synchronise with each other yet.
In that case you don’t want linearizability - because if you did have linearizability, it would mean you couldn’t add anything to your calendar if your phone was disconnected from the Internet. Because otherwise you would be violating the principle of linearizability that every client sees the up-to-date version of the data.
That example just shows that there is no one model that is appropriate for all circumstances. If you’re within a single data center and we’re talking about machines within a single data center, within a single geographic region, then often achieving linearizability in that context is feasible - because the network is fast, the communication is fairly reliable (occasionally network outages happen but it’s not that common). In that case, linearizability is fine. But, clearly, in the phone app case it’s not fine.
But even if you have, say, multiple data centers across different geographic regions, then linearizability would mean that anytime you wanted to write something in one data center you’d have to coordinate with at least several of the other data centers and that would slow down every single operation that you make. Is it worth that trade-off? It depends, it depends on what you want. It might be okay for some apps, but for other apps you would say no, actually we’d rather tolerate a lower level of consistency and have it go faster - or have it be more reliable in case of network interruptions.
Q: Let’s talk about one of your main areas of research - conflict-free replicated data types, CRDTs. Can you talk a little bit about what CRDTs are and what are the problems that they are trying to solve?
We’re looking at how you build collaboration software, in quite a broad concept - anything where several devices or several users may have a copy of some data and might be contributing to that in some way. The calendar app would be one example. Another example would be something like Google Docs… Anything where there’s several people contributing to some shared file or some shared dataset.
In those types of apps usually you want each user to be able to operate on their own copy of the data on their own computer. Because if you have to do a roundtrip of the Internet for every single user interaction, it would get really slow. You might even want to let users work offline entirely when they’re disconnected from the Internet.
But if you have this sort of “several users updating things at the same time without waiting for each other” approach, that means that different users’ views of this document or this shared data are going to diverge - so they’re going to end up with inconsistent views of this data.
And what CRDTs do is provide us with a mechanism for getting everybody back in synch again, getting everyone back into a consistent state in a way that doesn’t lose any users’ inputs, but preserves all of the changes that different users have made and merges them together in a way where everyone is guaranteed to end up with a consistent view of the document at the end.
Q: In one of your talks on CRDTs you presented the auto-merge library. It had different stages of maturity along the years, can you share the state of it nowadays? Interesting production use cases?
Auto-merge is a really exciting project that’s been in the works for a couple of years now. We are just in the process of shipping a 1.0 release.
Between the 0.X release series and the 1.0 we have made a lot of changes. In particular, we’ve changed the data formats to be a lot more compact - rather than using JSON, which gets extremely verbose, we’re using a compressed binary data format for storing data on disk and for sending it to other users over the network. We’ve updated the APIs to be nicer. The network protocols were updated to be more robust and ensure that data synchronisation between different devices always works in a way that’s performant but robust. At the moment I’m doing a lot of work on performance as well.
We’ve made a lot of progress in all those areas and have shipped a preview release of 1.0. This means that although it’s not totally nailed down yet, the APIs and the data formats and the network protocols are all stable now and we don’t plan to change them.
We have also put in forward- and backward-compatibility mechanisms, meaning we can add new features in the future without breaking data formats. So you can have different users using different versions of the software and they can still interoperate - we’ve put a lot of thought into that. So, that’s been coming along very nicely and we’re going to put the finishing touches on that and ship the 1.0 release hopefully in the next month or two.
One area that still needs work is performance. At the moment, if you have large amounts of data automerge can get quite slow. And that is an area for which we have various things in progress.
One of our efforts there has been to take the javascript implementation and rewrite the core of it in Rust, which we can then compile to web assembly and still use in a web browser. But also Rust is much nicer if you want to build native iOS apps, for example, since we can just compile it to native code, wrap it with a Swift library and then we can use it there. We can also use it from Java, or Kotlin, and make Android apps and so on. So it’s really great for cross-platform functionality.
Also, of course, using Rust allows us to achieve higher performance than we can in Javascript. Part of the performance work comes from using beta data structures, which are independent of the language that you’re using.
Q: In 2018 you gave a keynote at the Kafka Summit on “Is Kafka a database?”. Given where streaming platforms are headed to, do you still see convergence between the two concepts of streaming and database?
Oh yes, definitely. They’ve converged a lot in the last few years and that’s very encouraging to see. In particular, the integration of getting data out of databases and into streaming systems, that has got a lot better with change data capture systems. That allows us to use streaming systems that augment existing databases rather than replacing databases wholesale.
And then the streaming systems have themselves gained database-like capabilities by allowing you to define stream processing queries using SQL, for example. ksqlDB or Materialize are examples of that.
I still feel there’s a lot more to be unlocked there. I gave this talk way back in 2014 or something like that, called “Turning the database inside out”. In it I speculated a bit about what it would look like if we really reorient the way we think about databases around a streaming model. And some of that has happened. But there are also a bunch of things that haven’t really been addressed yet. In particular, the way that a lot of services and business logic apps are still built - it is still very much a kind of request - response model. Meaning that if the response later changes, because some underlying data in the service changes, you don’t get notified. You don’t get told, “Hey, I’ve now got an update to the response I gave you earlier, you should now update your view onto this data as well”, no. You can poll for changes, of course, you can just keep repeating the query, but that’s pretty inefficient. And it means your delay in getting any update is your polling interval, which is probably too long to be of use for any low-latency-type use cases.
What you can do now is package up your business logic as a stream processor - and then you get this sort of low-latency-notification-type dataflow. But that’s still a very different mode from working compared to the sort of microservices-style, rest-API-style services that people are mostly building.
So I think there’s still a lot more to be done in making the services that people build respond to data changes rather than just respond to queries.
Q: What do you think about Kafka’s move to rely on the RAFT algorithm for consensus and leader election instead of Zookeeper?
For Kafka that makes perfect sense. I think the Kafka team has wanted to do that for a very long time because ZooKeeper has always been a bit of a troublesome part of the Kafka ecosystem. I will note that what is the right decision for Kafka is not necessarily the right decision for everybody else. Just because Kafka has implemented its own consensus protocol does not mean that everyone else should be giving up ZooKeeper.
Kafka is very much a special case there because what they have - the Kafka data model being an append-only log that is replicated with consistency guarantees around this log - is itself in some sense a consensus protocol that they’ve had since day one. It’s totally inherent in what Kafka is doing anyway. So putting a bit of Raft on top of that is entirely a reasonable thing for Kafka to do, since Raft also produces a log - so it’s essentially just a bit of detail around how you do the leader election, and so on.
This doesn’t mean that ZooKeeper is bad - it still makes a lot of sense for other types of systems that need to do this type of coordination which don’t have this existing log infrastructure to rely on.
Q: If you do have Kafka deployed in your production system, maybe you can use that sometimes as a leader election mechanism by itself using topics and partitions?
Yes, possibly. You do have to be quite careful with exactly what you’re doing there. The Kafka team has some very excellent distributed systems engineers and they are up to date with the research on consensus protocols, they are using some formal methods to verify the correctness of their protocols and so on. They really know what they’re doing. It’s easy to stumble into this kind of thing and if you’re not totally up to date with all the things that can go wrong, you can easily make mistakes that mean your system no longer guarantees the properties that you were hoping it would guarantee. So I’ll just caution that it’s a difficult area to get right. It’s certainly possible to get it right, but it’s not entirely straightforward.
Q: How is your impact different in the academic world, where you publish papers and teach courses, versus your role in the industry as advisor to companies and start-ups such as Confluent, if we’re talking about Kafka?
I can still keep the connections to the industry and I can still advise people. I’m not doing very much of that at the moment just because I feel like my time is better spent doing fundamental research and publishing it. But it’s nice to know that that door is still open.
I’m enjoying academia more than I thought I would - I’m very much enjoying the freedom to just work on problems that I think are important without having to worry about if it’s going to be a commercially viable product or something like that. I can just do stuff that I think is worthwhile. I can freely publish everything that I do, I don’t have to get any approvals from anyone, like from the PR department on whether I’m allowed to say something or not - I can just publish anything I like, which is wonderful.
Every bit of code that I write is open source by default, I don’t even bother setting the GitHub repo to private because, why bother, I just work open source! That's a very nice position to be in, I feel privileged to be able to do that.
There’s some tension between how much do I focus on writing code versus how much do I focus on writing papers. And those are both useful, I think. Because clearly with open source code it’s closer to being in a state where somebody can just take it and use it and build upon it.
On the other hand, I’ve had a couple of nice experiences where I wrote a paper explaining some ideas and didn’t necessarily have the time to actually implement it all. So I just wrote down my state of thinking, published it, and then other people came and picked it up and took the ideas and implemented them, potentially in ways that I wouldn’t have even thought of myself.
In that way that’s a very nice form of impact, because if I was only publishing open source code, then people would get only that one implementation. But if that implementation didn’t exactly do what they wanted, they would end up having to reimplement it and they would end up reinventing a whole bunch of stuff.
With the paper I can actually lay out precisely what the trade-offs are , what the options here are, and what the fundamental ideas of the algorithm are. Then if somebody wants to pick that up - they’re welcome to use it and can implement it in whatever way they like.
I’ve also had a couple of examples where somebody would come to me and ask ‘Hey I’ve been thinking about this problem, do you have any ideas?’. And I would go ‘Oh, it just so happens that I drafted a paper on that very topic three months ago, let me send you the pdf file’. And that’s it, I just hand it over to them and they essentially have the documentation ready that they need in order to use the ideas for themselves.
That has worked surprisingly well, actually. I found that people will take the ideas from papers that I’ve written and just run with them and do their own implementations - and that’s wonderful.
Q: You recently set up crowdfunding through Patreon, what was the motivation behind that decision?
It was mainly an experiment, really. Just to see if this is something that could work. Patreon, I guess, is best known for musicians and artists, for fans to be able to support their work in some kind of sustainable way. I was interested if this also worked for computer science research and for teaching. And so far it has worked surprisingly well.
The level of support I’m getting from Patreon is now getting close to the salary I’m getting from the University. And that comes from a research grant, which I had to apply myself to get and which only lasts for a fixed amount of time - in my case 3 years. I still have a bit over one year left on that grant, which means that my job at the University is just going to evaporate in a bit over a year’s time - if my grant runs out and I haven’t figured out some follow-on funding.
Writing grant proposals can be useful because it forces you to articulate your ideas, explain why those are important and how you’re going to realise them. But also, writing grant proposals can also be a huge time sink. So I was curious if something like Patreon would work where it is not a totally passive income stream, it does require some effort as well, but also where it does make it possible to have a sustainable form of income that then doesn’t have this cliff edge that after 3 years it suddenly ends.
Most of my Patreon supporters seem to realise that they’re buying into something long-term here, and that gives me a bigger planning horizon where I have some certainty that I can continue in doing the work I’m doing, even if I don’t manage to bring in another grant. It allows me to know that there’s a level of security here, which is a wonderful thing to have.
The question is, how well would this generalise to other people as well - and that is difficult to say because, of course, I only have one single datapoint - which is myself - to go by. I have seen a few other friends who are computer scientists who have managed to make Patreon work. In my case, of course, I have a media following - I don’t know if this would work as well if I didn’t have as many Twitter followers, if I hadn’t written a fairly high profile book...
I wish that this principle would work for other researchers as well though - because it can be a really nice way of working. People who signed up to support me, even if it’s just 5 dollars a month, are investing their own personal money in the work that I’m doing. And that means there’s a community of people who are excited about the stuff that I’m working on - a community I can now enegate with. And I know that they are interested because they are invested in it, literally.
These are the people I can bounce ideas off of, I can send them early drafts of my work and ask for feedback. Just beyond the money, that is actually a really wonderful thing - to have that sort of community of people that really care about the work you’re doing.
Q: You recently published several blog posts on the rules, explicit and implicit, governing the virtual world, ranging from content moderation to open source licenses. Can you talk a little bit about your personal philosophy in this area?
“It’s difficult”, I guess, is the one word summary.
In the early days of the Internet people had this really utopian idea - think of the 90s or so - that the Internet would transcend all national borders, that we will all be one happy community worldwide, everybody together. Think Barlow’s Declaration of the Independence of Cyberspace from the mid-nineties. That was this whole idea that national governments are irrelevant and national laws are irrelevant because everything on the Internet is worldwide. And if we look at that now, 25 years later, I think disillusionment has set in and we’ve realised that actually that utopian ideal hasn’t really been realised at all. In fact, totally to the contrary, most of the spaces we now have to communicate over the Internet are controlled by big corporations like Facebook, beholden only to their shareholders and to nobody else.
At least a government (in a democratic state) will have some legitimacy from the fact that the people have voted and elected them - whereas Facebook is not elected by anybody. The Facebook board is elected by the shareholders, those are the only people that have a say, really. And beyond that the corporation can decide to do things in whatever way it likes.
And I feel that’s a deeply unsatisfying situation to be in. But it’s not obvious how you solve that either. I believe very much in democracy and that in the end the legitimacy of any power should come from the people. But people don’t always agree. Fundamentally, in different parts of the world, people have different value systems and there’s disagreement - for instance, on things that are legal in one country are not legal in another country.
And that extends very much to what speech is allowed. Say, holocaust denial is prohibited, illegal in Germany and in a number of other European countries. But it’s not illegal in the US. Or the things that you’re allowed to say in China will not be the same as the things that you are allowed to say in Israel. Because, you know, it’s a different political system, it’s a different legal system and a different value system.
It doesn’t seem possible to have one single worldwide agreement on what you’re allowed to say or what you’re not allowed to say. At the same time, just making it free for all, allowing anyone to say anything, doesn’t seem great as well. Because if you have any platform where you’re allowed to say anything - it’s just going to get overrun by extremists who are going to use it to do a lot of horrible things.
I think there is a duty of the technologists who build social media systems to ensure that they are somehow working towards the good. And that means that certain things have to be unacceptable. And people will have to debate what’s acceptable and what’s not. My personal opinion would be that far-right extremism is not acceptable and we should find a mechanism for not allowing that, for example. Obviously, people who believe in far-right ideology will disagree and will say ‘no, no, we have to have free speech, it’s the left-wing extremists who need to be shut down’.
The only way I see of ever coming to a conclusion with this is through public debate and some kind of democratic mechanisms. Which is how people have tried to resolve their conflicts for centuries. With democracy people are never always going to be satisfied with the outcome, it’s always going to be compromises. But that’s just human nature and I think there’s no way around that.
When we thought the Internet would be this totally international space where laws don’t apply, it was forgotten that users of the Internet are still human beings. And human beings still live in countries and are still subject to the laws of the jurisdictions that they live in. Probably we’ll have to get technologists together with sociologists and with political scientists and the other people who’ve been studying the human side of this for a long time and try to join forces - because there’s not going to be a pure technology solution to this. It’s going to have to involve people and debate and messiness, just because that’s the way humans are.
Definitely seems like an important thing that we have to figure out and we really don’t have those solutions yet.
Bio:
Dr Martin Kleppmann is a researcher in distributed systems at the University of Cambridge, and author of the best-selling book “Designing Data-Intensive Applications” (O’Reilly Media). He works on local-first collaboration software, CRDTs, security protocols, and formal verification of distributed algorithms. Previously he was a software engineer and entrepreneur at Internet companies including LinkedIn and Rapportive, where he worked on large-scale data infrastructure.
For more engineering updates and insights:
Join our Telegram channel
Visit us on GitHub
Subscribe to our YouTube channel