Friday, March 16, 2018

Paper review. A secure sharding protocol for open blockchains.

This paper appeared in ACM CCS'16. It is authored by Loi Luu, Viswesh Narayanan, Chaodong Zheng,  Kunal Baweja, Seth Gilbert, and Prateek Saxena.

Here is a video of the conference presentation.

The problem

The Bitcoin transaction throughput does not scale. Bitcoin's PoW blockchain consumes massive computational power yet can only process up to 7 transactions per second.

The paper proposes, Elastico, a new distributed agreement protocol, based on a non-PoW Byzantine consensus protocol, for permissionless blockchains.

The challenge is that classical/non-PoW Byzantine consensus protocols do not work in an open environment:

The main idea

The key idea in Elastico is to partition the network into smaller committees, each of which processes a disjoint set of transactions (or a "shard"). The number of committees grows linearly in the total computational power of the network. Each committee has a reasonably small number of members, around 100, so they can run a classical byzantine consensus protocol to decide their agreed set of transactions in parallel.

Sharding protocols are commonly used in distributed databases and in cloud infrastructure in trusted environments. Elastico provides the first sharding protocol for permissionless blockchains tolerating a constant fraction of byzantine network nodes.

The model

The agreement property is a relaxation of the original byzantine consensus problem. The "agreement" property allows the honest processors to be in "probabilistic agreement" such that processors agree on a value with some high probability, rather than being in exact agreement.

This probabilistic part comes from step 5 of the algorithm below. There is still a proof of work component in Elastico, hence, the agreement is probabilistic.

The Elastico protocol

  1. Identity establishment and committee formation: each node uses IP, public key and PoW to locally generate an identity. 
  2. Overlay setup for committees: Nodes communicate to discover identities of other nodes in their committee. A directory committee is formed to do this more efficiently, which entails more details.
  3. Intra-committee consensus: Nodes run a standard byzantine agreement protocol, PBFT, within their assigned committees to agree on a single set of transactions.
  4. Final consensus broadcast: The final committee computes a final block from all the values received from other committees by running PBFT and broadcasts it to the network.
  5. Epoch randomness generation: the final committee runs a distributed commit-and-xor scheme to generate an exponential based but bounded set of random values. These are broadcast and used in the PoW in the next epoch.

The results

Due to its use of committees, Elastico is expected to scale transaction rates almost linearly with available computation for mining: the more the computation power in the network, the higher the number of transaction blocks selected per unit time. The scalability experiments are done on Amazon EC2 with up to 1,600 nodes and confirm the theoretical scaling properties:

"With the same network implementation as in Bitcoin, the scale up (blocks per epoch) for 100, 200, 400, 800 and 1,600 nodes with equal computational power 2 are as theoretical expectation, namely 1, 1.89, 3.61, 6.98 and 13.5 times respectively. Finally, Elastico’s clean-slate design decouples the consensus from block-data broadcasts, hence the bandwidth spent by each node remains almost constant, regardless of the size of the network. Our simulations are necessarily on a smaller scale than Bitcoin; however, if we project our results to a full deployment to a network of Bitcoin’s scale, we can expect a scale up of 10,000 in the number of agreed values per epoch. This agreement throughput is 4 orders of magnitude larger than Bitcoin's."

The related work

Recently Bitcoin-NG, Aspen, and ByzCoin are also related work. They also satisfy the same properties with Elastico in that table.

MAD questions

1. Which is more secure: a probabilistic PoW blockchain or Elastico with the "#byzantine<1/3" assumption?

This may be a false dichotomy, because even in the PoW case the byzantine nodes is assumed to be less than 1/3rd of the network to avoid selfish mining attacks. Ok, given that, let's try to analyze further. In either case the leader has limited power: it cannot invent transactions for others but can only decide on whose transactions to include or not. So only double-spending attack can be performed. The PoW does a good job of it if you wait for 6 more blocks to be added to consider a transaction to be finalized/irreversible. But for Elastico, if the "byzantine<n/3" is violated it is easier to do the double spending attack because there is no PoW and 6 blocks rule to guard the chain against it.

2. The agreement in Elastico is probabilistic, because there is still a PoW component in step 5 of the algorithm: Epoch Randomness Generation. Does that mean Elastico does not provide *instant-irreversibility* of the chain and history can rewritten? Even when the byzantine ratio is less than 1/3?

I didn't see this discussed in the paper. Elastico does not use PoW to select leader that adds a block, but rather select committees. So Elastico may indeed be providing instant-irreversibility of an added block, when byzantine<1/3.

On the other hand, maybe there is a slight chance for violating instant-reversbility. What if a different set of nodes are chosen for the committees which do not have a memory of the last block in the log? There may be a slight chance to pull this off by selecting enough number nodes to whom the last block broadcast has not reached yet. The Byzcoin paper, which I will summarize later, provides instant-irreversibility using a more cleaner approach.

3. Why do we need the committees to run PBFT? Aren't they just putting together transactions in a block? You can do that without using PBFT provided that the transactions satisfy some integrity constraints/checks, right?

I guess this is just to make sure that the block produced is the work of a group, and not by just one individual. Otherwise, a byzantine individual node masquerading the group can unduly influence the outcome. So even when byzantine<1/3, with collusion of such byzantine nodes, agreement can be violated.

4. This is more of a nitpick than a question. It looks like the paper could have provided a more clear discussion on the bound on f: byzantine nodes. At one point the paper says: "Here, 1/4 is an arbitrary constant bounded away from 1/3, selected as such to yield reasonable constant parameters."  Then later it amends: "Note that we select f = 1/4 in order to achieve a practical value of committee size. Theoretically, Elastico can work with any f less than 1/3 by increasing the committee size c accordingly to f. The 1/3 bound is because we need to run a consensus protocol (e.g., PBFT) at every committee in Step 3, which can tolerate at most 1/3 fraction of malicious committee members.)"

5. How does Aspen compare with Elastico?
Aspen also has parallel tracks/channels for processing blocks. In Aspen parallel tracks are dedicated to specific channels. In Elastico parallel track sharding is mostly for improving throughput maybe sharding with user-ids. Elastico provides improvements in faster irreversibility. On the other hand, Aspen's sharding protocol is much simpler/cleaner than Elastico's.

Wednesday, March 14, 2018

Paper review. Service-oriented sharding with Aspen

This week in our seminar, we discussed this linked whitepaper/shortpaper from Adem Efe Gencer, Robbert van Renesse, Emin Gün Sirer.

The problem

Blockchains provide trustless auditability, tamper-resistance, and transparency due to the face of Byzantine participants.  As such there is a community of users that would like to use blockchains to record in an authoritative manner transactions/provenance for many assets (in addition to the prevalent use of cryptocurrencies), such as gold, silver, diamonds, gems, land records, deeds, mortgages, boat titles, fine art, health records, and domain names.

The simplest approach address this demand is to layer these additional blockchains on top of an existing secure blockchain such as Bitcoin. The OP_RETURN opcode is adopted for this purpose, and its use has been increasing rapidly. However, this is not scalable, as it leads to a stream of costly and burdensome transactions. And Bitcoin blockchain is already overburdened and crawling.

The other approach is to create a dedicated, specialized, standalone blockchain for each asset. But that approach suffers from the lack of mining power. How do you bootstrap each blockchain? (That is not very easy problem, especially initially when the number of miners is low, how do you know Byzantine miners are in the minority? At least when piggybacking on BitCoin we are assured that the blockchain had sufficient mining power to withstand attacks.) Moreover, even if it is possible to create a separate thriving blockchain ecosystem for each asset, that would be very wasteful. Given that Bitcoin mining consumes more electricity than many countries in the world does, what would be the cost of having so many different chains? Pretty please, let's not burn down the planet.

The main idea

The proposed protocol, Aspen, securely shards the blockchain (say a popular blockchain like Bitcoin) to provide high scalability to the service users.

Sharding is a well-established technique to improve scalability by distributing contents of a database across nodes in a network. But sharding blockchains is non-trivial. The main challenge is to preserve the trustless nature while hiding transactions in other services of a blockchain from users participating in one service.

"Aspen employs a sharding approach that comes with the following benefits: (1) preserves the total computational power of miners to secure the whole blockchain, (2) prevents users from double-spending their funds while maintaining the same trustless setup assumptions as Bitcoin, (3) improves scalability by absolving non-miner participants --i.e. service users-- from the responsibility of storing, processing, and propagating irrelevant data to confirm the validity of services they are interested in."

The paper would benefit a lot from clarifying this point further. There are two different entities in the blockchain: the service users and the miners. The sharding in Aspen helps reduce the workload on the users, because the users can just focus the transactions on the service they are using. On the other hand, the miners need to know about the history of all the services/channels as well as their service and integration rules. The miners are the nodes that generate the keyblocks and publish service transactions as microblocks as described below.

The architecture

Aspen is an instantiation of service-oriented sharding on Bitcoin-NG. Please go read my short summary of the Bitcoin-NG paper, if you are unfamiliar with it. It is a great paper that prescribes a way to scale the throughput of a PoW blockchain.

Each channel corresponds to a service. Each channel contains 1) the global genesis block (the exact same special block for each channel) 2) all key blocks (the exact same set of blocks for all channels) and 3) the set of microblocks containing custom transactions exclusive to that service.

In other words, Aspen distributes transactions to microblocks with respect to services. The key blocks keep microblock chains together; the key blocks are not per-service, they are the same (i.e. common, shared) for all services.

If you like to look at this from a service-oriented perspective, there are two protocols for each channel/service:
1) A service protocol defines the validity of transactions in a given channel, including the syntax for each transaction type, the relationship between transactions within a channel, the size, frequency, and format constraints for blocks that keep transactions.
2) An integration protocol specifies the security, incentive mechanism, valid service numbers, the genesis block, and the inter-channel communication process between the payment channel and the other channels.

Each channel maintains key blocks to enforce the integration protocol. And each channel maintains the microblocks only to contain the service-specific transactions. Two special channels, payment (to exchange funds) and registration (to add/update services), are defined by Aspen to help bootstrap the blockchain. We discuss these two next.

Flow of funds

To prevent double spending, Aspen uses the payment channel to make each fund spendable only in a specific channel. A special payment channel transaction, funding pore, enables users to lock funds to other channels. Transfers across channels are allowed only in one way, from the payment channel to others.

Alternatively, users can directly buy locked funds at the target channel to pay for the service running on the corresponding channel. Aspen enforces a high minimum fee for serializing funding pores to (1) discourage users from bloating the payment channel and (2) improve the fungibility of funds in nonpayment channels.

As shown in Figure 3.b, a coinbase transaction in a key block provides separate outputs to compensate the current and the previous miner for each service they provision.

Service integration

Users propose protocols to introduce or update services by generating transactions for the registration channel. A service protocol is specified in a platform independent language such as WebAssembly or Lua. The miners conduct an election to choose a registration channel transaction and indicate their choice using ballots. If a particular transaction is referred by more than a certain fraction of ballots, its protocols become active.

The voting process works similar to that in Bitcoin.

MAD questions

1. It may be too much of a burden to require that the miners store/process the entire history of all services/channels, and could create a scalability barrier for the blockchain. Checkpoints (a.k.a. snapshots) can help for this as they make the state more compact: you don't need to go all the way to the origin block, instead you can refer to a recent checkpoint to validate transactions. The paper mentions that the key blocks include checkpoints, but it is unclear if these checkpoints are snapshots at all.

What could go wrong if we included checkpoints as snapshots? An obvious strawman attack is where the leader is proposing an incorrect checkpoint to misrepresent information. I guess the other miners can notice this and add a poison transaction.

I don't know if there are other measures need to be taken to incorporate checkpoints. On a philosophical point, does using checkpoints violate the transparency of blockchain? When the whole log history is available, the nodes have more information to go with. Starting from a recent checkpoint adds some obscurity to the history.

2. It may not be necessary to link all the services together with each key block. So it may be possible to relax Aspen to have specialized miners that only track certain services; when those miners mine a keyblock they would only link the corresponding tracked services together. Statistically, each service will get a miner that tracks it soon and get a key block that integrates it to other services.

I am not sure if this relaxation would lead to security problems. I guess one problem could be incompatibilities between to consequent miners; the rewarding/incentive mechanism may need to be rethought for miners of different kinds trailing each other. Another problem could be that some less popular services may attract less number of miners and high latency transactions. On the other hand, if there is way to make this work, this relaxation can help alleviate the scalability problem for the miners as the number of services grow.

Wednesday, February 21, 2018

Paper summary: Bitcoin-NG -- A scalable Blockchain Protocol

This week in our seminar, we discussed the Bitcoin-NG paper. The paper appeared in NSDI 16, and is authored by Ittay Eyal, Adem Efe Gencer, Emin Gün Sirer, and Robbert van Renesse at the Cornell University.

The model section of this paper is very well formalized and written. This is like a Rosetta Stone find for classical distributed systems researchers that want to enter blockchain research. So in this summary, I will start by covering as much of that as I can.

The Nakamoto Consensus Problem 

The blockchain system is comprised of a set of nodes N connected by a reliable peer-to-peer network. Nodes can generate public-private key-pairs for themselves. The system employs a cryptopuzzle system, defined by a cryptographic hash function H. The solution to a puzzle defined by the string y is a string x such that H(y|x) --the hash of the concatenation of the two-- is smaller than some target (i.e., the hash has k number of leading zeros). Each node i has a limited amount of compute power, called mining power, measured by the number of potential puzzle solutions it can try per second. A solution to a puzzle constitutes a "proof of work", as it statistically indicates the amount of work a node had to perform in order to find it.

At any time t, a subset of nodes B(t) are Byzantine and can behave arbitrarily, controlled by a single adversary. The mining power of the Byzantine nodes is less than 1/4 of the total compute power at any given time.

Comparing the Nakamoto consensus problem with the classic Byzantine consensus problem is very instructional. Nakamoto consensus has probabilistic guarantees for Termination, Agreement, and Validity, whereas the classic Byzantine Consensus has deterministic guarantees for them.

BitCoin's Blockchain protocol

Bitcoin provides a Byzantine fault tolerant blockchain protocol for implementing a decentralized cryptocurrency. Each user commands addresses, and sends Bitcoins by forming a transaction from her address to another's address and sending it to the Bitcoin mining nodes.  Transactions are protected with cryptographic techniques that ensure only the rightful owner of a Bitcoin address can transfer funds from it. A client owns x Bitcoins at time t if the aggregate of unspent outputs to its address is x. Miners accept transactions only if their sources have not been spent, preventing users from double-spending their funds. The miners commit the transactions into a global append-only log called the blockchain. If multiple miners create blocks with the same preceding block, the chain is forked into branches, forming a tree. All miners add blocks to the heaviest chain of which they know, with random tie-breaking.

If you want to understand the blockchain data structure, this youtube video and demo is fantastic for it.

Unfortunately Bitcoin is haunted with scalability woes. The maximum rate at which Bitcoin can process transactions is capped by the block size and block interval.

Increasing the block size (which is currently set at 1MB) improves throughput, but the resulting bigger blocks take longer to propagate in the network. Of course increasing that to 10 MB will not cause a significant problem in propagation, after all bandwidth is not that limited. However, taken at extreme the tradeoff will be there. Moreover, every second of headstart counts for mining the next block: New blocks received late means miners are wasting resources by building on an old block that is no longer the most recent.

Reducing the block interval reduces latency, but leads to instability due to frequent forks. Bitcoin currently targets a conservative 10 minutes between blocks, yielding 10-minute expected latencies for transactions to be encoded in the blockchain.

With this setup, Bitcoin yields a wimpy 1 to 3.5 transactions per second. Transaction finalization is also problematic. If you wait for 6 blocks to append to declare a transaction to be finalized, that will take an expected 60 minutes time.


Bitcoin-NG is a protocol that improves on Bitcoin in terms of transaction throughput and latency of propagation. Bitcoin-NG’s latency is limited only by the propagation delay of the network, and its bandwidth is limited only by the processing capacity of the individual nodes. Bitcoin-NG achieves this performance improvement by decoupling Bitcoin’s blockchain operation into two planes: leader election and transaction serialization. In Bitcoin-NG, consensus is pushed back only to identify the leader, and serialization is performed by the leader. This seems to me to be a classic application of chain replication (while the paper does not cite chain replication).

Bitcoin-NG divides time into epochs, where each epoch has a single leader. As in Bitcoin, leader election is performed randomly and infrequently via proof-of-work. Once a leader is chosen, it is entitled to serialize transactions via microblocks unilaterally until a new leader is chosen, marking the end of the former’s epoch.

Thus the protocol introduces two types of blocks: key blocks for leader election and microblocks that contain the ledger entries. Like a Bitcoin block, a key block contains the reference to the previous block (either a key block or a microblock, usually the latter), the current Unix time, a coinbase transaction to pay out the reward, a target value, and a nonce field containing arbitrary bits. Unlike Bitcoin, a key block in Bitcoin-NG contains a public key that will be used in subsequent microblocks.

Once a node generates a key block it becomes the leader. As a leader, the node is allowed to generate microblocks at a set rate smaller than a predefined maximum. (Specifically, if the timestamp of a microblock is in the future, or if its difference with its predecessor's timestamp is smaller than the minimum, then the microblock is invalid. This prohibits a greedy leader from swamping the system with microblocks.)  A microblock contains ledger entries and a header. The header contains the reference to the previous block, the current Unix time, a cryptographic hash of its ledger entries, and a cryptographic signature of the header. The signature uses the private key that matches the public key in the latest key block in the chain.

Microblock fork prevention is essential for this system, since the leader can spit out many microblocks quickly to make more money from transaction fees.  To report on a leader that violates the microblock production rate, a poison transaction is employed. The poison transaction contains the header of the first block in the pruned branch as a proof of fraud, and it has to be placed on the blockchain within the maturity window of the misbehaving leader’s key block, and before the revenue is spent by the malicious leader.

The new leader cannot fake an offending microblock to accuse the old leader, because the poison transaction should contain the private signature of the previous leader. Moreover the 40/60 partitioning of credit for microblock transactions between the old leader and the new leader incentivizes the new leader not to cut the microblock set short, because it would like to get more revenue from them. So the new leader that mines the last key block has all the incentive to behave according to the protocol. The 40/60 partitioning of the credit is shown to satisfy these constraints via mathematical modeling in the paper.

It looks like the microblocks need not be chained in a sequence. Rather the microblocks form a set that follow the keyblock of the corresponding leader and satisfying the rate limitations prescribed in the protocol. So what is the authoritative microblock set taken then, given that slightly different set of microblocks may arrive at different nodes of the network. It looks like the new leader (that mines the next key block) is the one to authoritatively decide on that.

MAD questions

1) In the Bitcoin protocol, if we increase the block size what could go wrong? That would definitely help with the throughput problems Bitcoin is facing. It would mean some slight increase in delivery latency depending on the bandwidth available in the path. But even increasing it to 10MB is unlikely to break anything, I think. (But hey I am a newbie in this space.) So why are there endless debates about this in the Bitcoin community? Could it be that the small blockers are concerned more about actually the increase in the transaction volume, which would mean the history grows quickly, which would make being a full Bitcoin node (rather than just a client) become more of a burden? But isn't that just delaying the inevitable? Is the idea here to delay the inevitable until more storage space and computing become available? That sounds so wrong to me as an engineer.

2) It looks like a hard fork will be needed for switching to the Bitcoin-NG protocol, because the current Bitcoin clients are unaware of the concept of microblocks. Oh, well, good luck with getting on a "consensus" on a hard fork in Bitcoin protocol. In his latest blog post "Why Decentralization Matters", Chris Dixon cited as the biggest problem with the centralized services that they can change the rules on the users: "The good news is that billions of people got access to amazing technologies, many of which were free to use. The bad news is that it became much harder for startups, creators, and other groups to grow their internet presence without worrying about centralized platforms changing the rules on them, taking away their audiences and profits. This in turn stifled innovation, making the internet less interesting and dynamic."

On the other hand, the "community-governed" decentralized protocols seem to be haunted by the reverse problem: It is very hard to change the rules even to fix problems with the protocol. Isn't that as big, if not a bigger, problem as the former for stifling innovation?

Tuesday, February 13, 2018

Paper review. IPFS: Content addressed, versioned, P2P file system

This week we discussed the IPFS whitepaper by Juan Benet in my Distributed Systems Seminar.

Remember peer-to-peer systems? IPFS is "peer-to-peer systems reloaded" with improved features. IPFS is a content-addressed distributed file system that combines Kademlia + BitTorrent + Git ideas. IPFS also offers better privacy/security features: it provides cryptographic hash content addressing, file integrity and versioning, and filesystem-level encryption and signing support.

The question is will it stick? I think it won't stick, but this work will still be very useful because we will transfer the best bits of IPFS to our datacenter computing as we did with other peer-to-peer systems technology. The reason I think it won't stick has nothing to do with the IPFS development/technology, but has everything to do with the advantages of centralized coordination and the problems surrounding decentralization. I rant more about this later in this post. Read on for the more detailed review on IPFS components, killer app for IPFS, and MAD questions.

IPFS components


Nodes are identified by a NodeId, the cryptographic hash3 of a public-key, created with S/Kademlia’s static crypto puzzle. Nodes store their public and private keys (encrypted with a passphrase).


Transport: IPFS can use any transport protocol, and is best suited for WebRTC DataChannels(for browser connectivity) or uTP.
Reliability: IPFS can provide reliability if underlying networks do not provide it, using uTP or SCTP.
Connectivity: IPFS also uses the ICE NAT traversal techniques.
Integrity: IPFS optionally checks integrity of messages using a hash checksum.
Authenticity: IPFS optionally checks authenticity of messages using HMAC with sender’s public key.


To find other peers and objects, IPFS uses a DSHT based on S/Kademlia and Coral. Coral DSHT improves over by Kademlia based on the three rules of real-estate: location, location, location. Coral stores addresses to peers who can provide the data blocks taking advantage of data locality. Coral can distribute only subsets of the values to the nearest nodes avoiding hot-spots. Coral organizes a hierarchy of separate DSHTs called clusters depending on region and size. This enables nodes to query peers in their region first, "finding nearby data without querying distant nodes" and greatly reducing the latency of lookups.


In IPFS, data distribution happens by exchanging blocks with peers using a BitTorrent inspired protocol: BitSwap. Unlike BitTorrent, BitSwap is not limited to the blocks in one torrent. The blocks can come from completely unrelated files in the filesystem. In a sense, nodes come together to barter in the marketplace. BitSwap incentivizes nodes to seed/serve blocks even when they do not need anything in particular. To avoid leeches (freeloading nodes that never share), peers track their balance (in bytes verified) with other nodes, and peers send blocks to debtor peers according to a function that falls as debt increases. For bartering, potentially, a virtual currency like FileCoin (again by Juan Benet) can be used.


IPFS builds a Merkle DAG, a directed acyclic graph where links between objects are cryptographic hashes of the targets embedded in the sources. (This video explains Merkle Trees superbly.) Merkle DAGs provide IPFS many useful properties:
1. Content addressing: All content is uniquely identified by its multihash checksum.
2. Tamper resistance: all content is verified with its checksum.
3. Deduplication: all objects that hold the exact same content are equal, and only stored once.


IPFS also defines a set of objects for modeling a versioned filesystem on top of the Merkle DAG. This object model is similar to Git’s:
1. block: a variable-size block of data.
2. list: an ordered collection of blocks or other lists.
3. tree: a collection of blocks, lists, or other trees.
4. commit: a snapshot in the version history of a tree.


IPNS is the DNS for IPFS. We have seen that NodeId is obtained by hash(node.PubKey). Then IPNS assigns every user a mutable namespace at: /ipns/<NodeId>. A user can publish an Object to this /ipns/<NodeId> path signed by her private key. When other users retrieve the object, they can check the signature matches the public key and NodeId. This verifies the authenticity of the Object published by the user, achieving mutable state retrieval.

Unfortunately since <NodeId> is a hash, it is not human friendly to pronounce and recall. For this DNS TXT IPNS Records are employed. If /ipns/<domain> is a valid domain name, IPFS looks up key ipns in its DNS TXT records: TXT "ipfs=XLF2ipQ4jD3U ..." 
# the above DNS TXT record behaves as symlink
ln -s /ipns/XLF2ipQ4jD3U /ipns/

There is even the Beaker browser to help you surf IPFS. But its usability is not great.  If IPFS wants to manage the web, it should further improve its IPNS and content discovery game. Where is the search engine for IPFS content? Do we need to rely on links from friends like the 1993's Web?

What is the killer app for IPFS?

The introduction of the paper discusses HTTP and Web, and then says:
"Industry has gotten away with using HTTP this long because moving small files around is relatively cheap, even for small organizations with lots of traffic. But we are entering a new era of data distribution with new challenges: (a) hosting and distributing petabyte datasets, (b) computing on large data across organizations, (c) high-volume high-definition on-demand or real-time media streams, (d) versioning and linking of massive datasets, (e) preventing accidental disappearance of important files, and more. Many of these can be boiled down to "lots of data, accessible everywhere." Pressed by critical features and bandwidth concerns, we have already given up HTTP for different data distribution protocols. The next step is making them part of the Web itself. 
What remains to be explored is how [Merkle DAG] data structure can influence the design of high-throughput oriented file systems, and how it might upgrade the Web itself. This paper introduces IPFS, a novel peer-to-peer version-controlled filesystem seeking to reconcile these issues."
How common are petabyte or even gigabyte files on the Internet? There is definitely an increase in size trend due to the popularity of the multimedia files. But when will this become a pressing issue? It is not a pressing issue right now because CDNs help a lot for reducing traffic for the Internet. Also bandwidth is relatively easy to add compared to latency improvements. Going for a decentralized model globally comes with several issues/headaches, and I don't know how bad the bandwidth problems would need to get before starting to consider that option. And it is not even clear that the peer-to-peer model would provide more bandwidth savings than CDNs at the edge model.

I am not convinced that the Web is the killer application for IPFS, although at the end, the paper gets ambitious:
"IPFS is an ambitious vision of new decentralized Internet infrastructure, upon which many different kinds of applications can be built. At the bare minimum, it can be used as a global, mounted, versioned filesystem and namespace, or as the next generation file sharing system. At its best, it could push the web to new horizons, where publishing valuable information does not impose hosting it on the publisher but upon those interested, where users can trust the content they receive without trusting the peers they receive it from, and where old but important files do not go missing. IPFS looks forward to bringing us toward the Permanent Web."
Decentralization opens a Pandora's box of issues. Centralized is efficient and effective. Coordination wants to be centralized. A common and overhyped misconception is not centralized is not scalable and centralized is a single point of failure. After close to two decades of work in cluster computing and cloud computing, we have good techniques in place for achieving scalability and fault-tolerance for centralized (or logically centralized, if you like) systems. For scalability, shard it, georeplicate it, and provide CDNs for reading. For fault-tolerance, slap Paxos on it, or use chain replication systems (where Paxos guards the chain configuration), or use the globe-spanning distributed datastores available today. Case in point, Dropbox is logically-centralized but is very highly available and fault-tolerant, while serving to millions of users. Facebook is able to serve billions of users.

If you want to make the natural disaster tolerance argument to motivate the use of IPFS, good luck trying to use IPFS over landlines when power and ISPs are down, and good luck trying to form a multihop wireless ad hoc network over laptops using IPFS. Our only hope in a big natural disaster is cell towers and satellite communication. Disaster tolerance is serious work and I hope governments around the world are funding sufficient research into operational, planning, and communications aspects of that.

In Section 3.8, the whitepaper talks about the use cases for IPFS:
1. As a mounted global filesystem, under /ipfs and /ipns.
2. As a mounted personal sync folder that automatically versions, publishes, and backs up any writes.
3. As an encrypted file or data sharing system.
4. As a versioned package manager for all software.
5. As the root filesystem of a Virtual Machine.
6. As the boot filesystem of a VM (under a hypervisor).
7. As a database: applications can write directly to the Merkle DAG data model and get all the versioning, caching, and distribution IPFS provides.
8. As a linked (and encrypted) communications platform.
9. As an integrity checked CDN for large files (without SSL).
10. As an encrypted CDN.
11. On webpages, as a web CDN.
12. As a new Permanent Web where links do not die.
I don't think any of these warrant going full peer-to-peer. There are centralized solutions for them, or centralized solutions are possible for them.

An important use case for IPFS is to circumvent government censorship. But isn't it easier to use VPNs then to use IPFS for this purpose. (Opera browser comes with VPN build-in, and many easy to use VPN apps are available.) If the argument is that the governments can ban VPNs or prosecute people using VPN software, those issues also apply to IPFS unfortunately. Technology is not always the solution especially when dealing with big social issues.

IPFS may be a way of sticking it to the man. But the invisible hand of the free market forces also help here; when one big corporation starts playing foul and upsets the users, new companies and startups quickly move in to disrupt the space and fill in the void.

Again, I don't want to come across wrong. I think IPFS is great work, and Juan Benet and IPFS contributors accomplished a gigantic task, with a lot of impact on future systems (I believe the good parts of IPFS will be "adopted" to improve Web and datacenter/cloud computing). I just don't believe dialing the crank to 11 on decentralization is the right strategy for wide adoption. I don't see the killer application that makes it worthwhile to move away from the convenience of the more centralized model to open the Pandora's box with a fully-decentralized model.

MAD questions 

1) Today's networking ecosystem evolved for the client-server model, what kind of problems could this create for switching to peer-to-peer model? As a basic example, the uplink at residential (or even commercial) spaces is an order of magnitude less than downlink assuming they are consumers of traffic not originators of traffic. Secondly, ISPs (for good or bad) evolved to take on traffic shaping/engineering responsibilities peering with other ISPs. It is a complex system. How does popular IPFS use interact with that ecosystem.

2) As a related point, smartphones gained primary citizenship status in today's Internet. How well can peer-to-peer and IPFS get along with smartphones? Smartphones are very suitable to be thin clients in the cloud computing model, but they are not suitable to act as peers in a peer-to-peer system (both for battery and connection bandwidth reasons). To use a technical term, the smartphones will be leeches in a peer-to-peer model. (Well unless there is good token/credit system in place, but it is unrealistic to expect that soon.)

3) On the academic side of things, designing a decentralized search engine for IPFS sounds like a great research problem. Google had it easy in the datacenter but can you design a decentralized keyword/content based search engine (or one day old indexes) maintained in a P2P manner over IPFS nodes? Popularity of a file in the system (how many copies it has in the system) can play a role in its relevance ranking for the keyword. Also could a bloom filter like data structure be useful in a p2p search?

4) Here are some more pesky problems with decentralization. I am not clear if satisfactory answers exist on these. Does IPFS mean I may be storing some illegal content originated by other users?
How does IPFS deal with the volatility? Just closing laptops at night may cause unavailability under an unfortunate sequence of events. What is the appropriate number of replicas for a data to avoid this fate? Would we have to over-replicate to be conservative and provide availability?
If IPFS is commonly deployed, how do we charge big content providers that benefit from their content going viral over the network? Every peer chips in distributing that content, but the content generator benefits let's say by way of sales. Would there need to be a token economy that is all seeing and all fair to solve this issue?

5) Is it possible to use Reed-Solomon erasure coding with IPFS? Reed-Solomon codes are very popular in the datacenters as they provide great savings for replication.

6) IPFS does not tolerate Byzantine behavior, right? The crypto puzzle needed for Node Id can help reduce the false spammers, as it makes them do some work. But after joining, there is no guarantee that the peers will play it fair: they can be Byzantine to wreak havoc on the system. But how much problems can they cause? Using cryptos and signatures prevent many problems. But can the Byzantine nodes somehow collude to cause data loss in the system, making the originator think the data is replicated, but then deleting this data? What other things can go wrong?

Saturday, February 10, 2018

Paper summary. SnailTrail: Generalizing critical paths for online analysis of distributed dataflows

Monitoring is very important for distributed systems, and I wish it would receive more attention in research conferences. There has been work on monitoring for predicate detection purposes and for performance problem detection purposes. As machine learning and big data processing frameworks are seeing more action, we have been seeing more work on the latter category. For example in ML there have been work on how to figure out what is the best configuration to run. And in the context of general big data processing framework there has been work on identifying performance bottlenecks.

Collecting information and creating statistics about a framework to identify the bottleneck activities seems like an easy affair. However, the "making sense of performance" paper (2015) showed that this is not as simple as it seems, and sophisticated techniques such as blocked time analysis are needed to get a more accurate picture of performance bottlenecks.

This paper (by ETH Zurich and due to appear in NSDI 18) extends the performance bottleneck analysis to more general frameworks and to be supported by an online monitoring tool called SnailTrail. SnailTrail is written in Rust, over the Timely Dataflow framework. It supports monitoring of applications written in Flink, Spark, Timely Dataflow, Tensorflow, and Heron.

SnailTrail overview

The SnailTrail tool operates in 5 stages:
  • it ingests the streaming logs from the monitored distributed application,
  • slices those streams into windows, 
  • constructs a program activity graph (PAG) for the windows, 
  • computes the critical path analysis of the windows, and 
  • outputs the summaries. 

Program activity graph (PAG) is a directed acyclic graph. The vertices denote the start and end of activities, such as: data processing, scheduling, buffer management, serialization, waiting, application data exchange, control messages, or unknown activities. The edges has a type and a weight for the activities. The edges capture the happened-before relationships between the vertices. Figure 2.a shows an example. The figure also shows how to project this PAG to an interval, which is pretty straightforward and as you expect.

As output, SnailTrail provides summaries for operation, worker, and communication bottlenecks. These summaries help identify which machine is overloaded, and which operators should be re-scaled. The figure below shows examples.

Critical path analysis

Critical path analysis (CPA) is a technique originally introduced in the context of project planning. CPA produces a metric which captures the importance of each activity in the transient critical paths of the computation.
The CPA score calculation in SnailTrail centers on the Transient Path Centrality concept which corresponds to the number of paths this activity appears on. In the formula e[w] corresponds to the weight of the edge e, which is taken as the start/end time duration of the activity.

In Figure 3, the activity (u,v) has the highest CPA score because it is involved in all of the 9 paths that appear in this window.

A noteworthy thing in the PAG in Figure 3 is the wait period (denoted as dashed lines) after b in worker 1. Because worker 1 is blocked with a wait after b, that path terminates at b, and does not feed into another path. The paper explains this as follows: "Waiting in our model is always a consequence of other, concurrent, activities, and so is a key element of critical path analysis: a worker does not produce anything useful while waiting, and so waiting activities can never be on the critical path." Because the two wait states terminates in deadend some of the paths, in this interval depicted in Figure 3, there are only 9 possible paths.

The formula above enables us to calculate the CPA scores without enumerating and materializing all the transient critical paths of the computation in each window. But how do you calculate N (which, for Figure 3 is calculated as 9) without enumerating the critical paths? It is simple really. The PAG is a directed acyclic graph (DAG). Everytime there is a split in the path, you copy path_count to both edges. Everytime two edges join, you set the new path_count to be the sum of the path_counts in the two incoming edges. At the end of the interval, look at how many paths are exiting, that is N. Give this a try for Figure 3 yourself.

MAD questions

1) What could be some drawbacks/shortcomings of PCA? One reason PCA may not be representative is because one execution of the application may not be typical of all executions. For example in Fig 3, w3 may take long time in the next execution in c-d activity and that could become the bottleneck in another execution of the application. But it is possible to argue that since SnailTrail produces summaries, this may not be a big issue. Another reason PCA may not be representative is because the execution may be data dependent. And again it is possible to argue that this won't be a big issue if the application uses several data in processing, and things get averaged.

Another criticism of CPA could be that it does not compose. Try combining two adjacent windows; that would lead to changing the CPA scores for the activities. Some previously low scored  activities will jump up to be high, and some highly scored activities will go down. Because of this non-composition, it becomes important to decide what is the best window/interval size to calculate these CPA metrics for summarization purposes. On the other hand, it may be reasonable to expect these metrics to be non-composable, since these metrics (blocked time analysis and critical time analysis) are designed to be complex to capture inherent/deep critical activities in the computations.

Could there be another approach than the CPA scoring presented here to capture the importance of an execution activity in the paths of the computation?  Maybe something that uses the semantics of activity types and their relation to each other?

2) The SnailTrail method uses snapshots to determine the start and end of the windows that the CPA scoring algorithm works on. Does time synchronization need to be perfect for snailtrail snapshot and analysis to work? What are the requirements on the time synchronization for this to work? 

It turns out this currently requires perfect clock synchronization. The evaluation experiments are run in the same machine with 32 cores. Without perfect clock synchronization, the snapshots may not be consistent and that could break the path calculation and CPA scoring. Hybrid Logical Clocks can help deal with the uncertainty periods in NTP clocks and can make the method work. The paper addresses this issue in Appendix F: "In SnailTrail, we assume that the trend toward strong clock synchronization in datacenters means that clock skew is not, in practice, a significant problem for our analysis. If it were to become an issue, we would have to consider adding Lamport clocks and other mechanisms for detecting and correcting for clock skew."

3) The paper doesn't have an example of improvement/optimization based on summary analytics. Even when the summary shows the high scored CPA activities to improve upon, it will be tricky for a developer to go in and change the code to improve things, because there is no indication of how easy it would be to improve the performance on this high CPA score activities. 

One way to address this could be to extend the CPA idea as follows. After ranking the activities based on CPA scores, construct another ranking of activities based on ease of optimizing/improving (I don't know how to do this, but some heuristics and domain knowledge can help). Then start the improvements/optimizations from the easiest to optimize activities that have highest CPAs.
Another way to make the summary analytics more useful is to process it further to provide more easily actionable suggestions, such as add more RAM, more CPU, more network/IO bandwidth. Or the suggestions could also say that you can do with less of resources of one kind, which can helpf for saving money in cloud deployments. If you can run with similar performance but on cheaper configurations, you would like to take that option. Would this extension require adopting SnailTrail monitoring and CPA scoring more towards capturing  resource usage related metrics? 

4) Since CPA first appeared in the project/resource management domain, could there be other techniques there that can apply to performance monitoring of distributed execution?

5) I think SnailTrail breaks new ground in the "context" or "philosophy" of monitoring: it is not trace based, it is not snapshot based, but it is window/interval-based. In our work on Retroscope, we argued it is important to aggregate the paths for both spatially (across the distributed computation) and temporally (as an evolving picture of the system's performance). SnailTrail extends the snapshot view to window views.


Frank McSherry recently joined ETH Zurich's systems group and will be helping with the strymon project and the SnailTrail work, so I'll be looking forward to more monitoring work from there.  

Wednesday, February 7, 2018

Think before you code

I have been reading Flash boys by Michael Lewis. It is a fascinating book about Wall Street and high-frequency traders. I will write a review about the book later but I can't refrain from an overview comment. Apart from the greediness, malice, and sneakiness of Wall Street people, another thing that stood out about them was the pervasive cluelessness and ineptness. It turns out nobody knew what they were doing, including the people that are supposed to be regulating things, even sometimes the people that are gaming the system. This brought to my mind Steve Jobs' quote from his 1994 interview: "Everything in this world... was created by people no smarter than you."

Anyways, today I will talk about something completely different in the book that caught my eye. It is about thinking before coding.

On Page 132 of the book:
Russians had a reputation for being at the best programmers on Wall Street, and Serge thought he knew why: They had been forced to learn to program computers without the luxury of endless computer time. Many years later, when he had plenty of computer time, Serge still wrote out new programs on paper before typing them into the machine. "In Russia, time on the computer was measured in minutes," he said. "When you write a program, you are given a tiny time slot to make it work. Consequently we learned to write the code in ways that minimized the amount of debugging. And so you had to think about it a lot before you committed it to paper.... The ready availability of computer time creates this mode of working where you just have an idea and type it and maybe erase it ten times. Good Russian programmers, they tend to have had that one experience at some time in the past--the experience of limited access to computer time."

Of course, Dijkstra was a big proponent of thinking before programming. Here in EWD 1284, he compares European CS vs American CS. He didn't have nice things to say about the keyboard-happy American CS.
The major differences between European and American CS are that American CS is more machine-oriented, less mathematical, more closely linked to application areas, more quantitative and more willing to absorb industrial products in its curriculum. For most of these differences there are perfect historical explanations, many of which reflect the general cultural differences between the two continents, but for CS we have also to take into account the special circumstance that due to the post-war situation, American CS emerged a decade earlier, for instance at the time when design, production, maintenance and reliability of the hardware were still causes for major concern. The names of the early professional societies are in this respect revealing: the “Association for Computing Machinery” and the “British Computer Society”. And so are the names of the scientific discipline and the academic departments: in the US, CS is short for Computer Science, in Europe it is short for Computing Science.
The other circumstance responsible for a transatlantic difference in how CS evolved I consider a true accident of history, viz. that for some reason IBM was very slow in getting interested in Europe as a potential market for its computers, and by the time it targeted Europe, this was no longer virgin territory. Consequently, IBM became in Europe never as predominant as it has been in Northern America. 

Here in EWD1165, Dijkstra shares an anecdote about thinking before coding and uses that as an opportunity to take a shot at "software engineering". Man, Dijkstra had the best rants.
A recent CS graduate got her first job, started in earnest on a Monday morning and was given her first programming assignment. She took pencil and paper and started to analyse the problem, thereby horrifying her manager 1.5 hours later because “she was not programming yet”. She told him she had been taught to think first. Grudgingly the manager gave her thinking permission for two days, warning her that on Wednesday she would have to work at her keyboard “like the others”! I am not making this up. And also the programming manager has found the euphemism with which to lend an air of respectability to what he does: “software engineering”.

In fact, Dijkstra takes things further, and advocates even forgoing the pen and the paper when thinking:
What is the shortest way to travel from Rotterdam to Groningen, in general: from given city to given city. It is the algorithm for the shortest path, which I designed in about twenty minutes. One morning I was shopping in Amsterdam with my young fiancée, and tired, we sat down on the café terrace to drink a cup of coffee and I was just thinking about whether I could do this, and I then designed the algorithm for the shortest path. As I said, it was a twenty-minute invention. In fact, it was published in ’59, three years late. The publication is still readable, it is, in fact, quite nice. One of the reasons that it is so nice was that I designed it without pencil and paper. I learned later that one of the advantages of designing without pencil and paper is that you are almost forced to avoid all avoidable complexities. Eventually that algorithm became, to my great amazement, one of the cornerstones of my fame.
Dijkstra (2001), in an interview with Philip L. Frana. (OH 330; Communications of the ACM 53(8):41–47)"

In several of his EWDs, Dijkstra mentioned how he favored solving problems without pen and paper and just by thinking hard, and how he has the entire article sketched in his mind before he sits to write it down in one go. Here is an example where he mentions the Mozart versus Beethoven approach to composing. Obviously Dijkstra was on the Mozart camp.
There are very different programming styles. I tend to see them as Mozart versus Beethoven. When Mozart started to write, the composition was finished. He wrote the manuscript and it was 'aus einem Guss' (from one cast). In beautiful handwriting, too. Beethoven was a doubter and a struggler who started writing before he finished the composition and then glued corrections onto the page. In one place he did this nine times. When they peeled them, the last version proved identical to the first one.
Dijkstra (2001) Source: Denken als discipline, a program from Dutch public TV broadcaster VPRO from April 10th, 2001 about Dijkstra"

For the record, and not that you care, I am on team Beethoven. Being perfectionist and taking an "I will get it right in one shot" approach to thinking/designing makes me freeze. Separating concerns by dividing my thinking between a creative mode and criticizing mode works much better for me. (I talked about that in here, here, and here.)

Maybe my thinking is sloppy and I need crutches. But it is possible to argue that writing/prototyping is a tool--not a crutch-- for thinking, and that tools are invaluable capability multipliers. And on the power of writing as a tool, I will end with these quotes:
Writing is nature's way of letting you know how sloppy your thinking is. -- Guindon. 
If you think without writing, you only think you're thinking. -- Leslie Lamport 
Without writing, you are reduced to a finite automaton.
With writing you have the extraordinary power of a Turing machine. -- Manuel Blum

MAD questions

1) Hmm, a handicap that becomes an advantage. I think that was the entire theme of the "David and Goliath" book by Gladwell. Of course, that book got a largely negative critical response. But that doesn't mean it cannot be a useful model sometimes.

Are there scientifically proven studies of a seeming handicap becoming an advantage? What are some examples?

2) Yeah, about that whole Mozart versus Beethoven mode thing... Am I mistaken? Should I be more open-minded about the Mozart mode? What is doable by the Mozart mode that would be impossible to do with the Beethoven mode?

3) Well, this is not a question. This is self reflection. Way to go Murat! You started with Flash Boys, jumped to Steve Jobs's 1994 interview, and finished with comparing Mozart and Beethoven mode. A great display of "discipline in thought" indeed.

Thursday, February 1, 2018

Paper review. Blockchains from a distributed computing perspective

Our distributed systems seminar met the first time this Monday. I went through the syllabus, explained how we will run the seminar, and introduced the papers we will discuss. In the second hour, to give students an overview of the blockchains technology, I went through this paper: "Blockchains from a distributed computing perspective".

I liked this paper a lot. It is from a respected distributed systems expert, Maurice Herlihy. It is written to be accessible and expository. The paper gives several examples (with increasing sophistication) to explain the blockchain concepts concretely. Finally, the paper is a perfect fit with our seminar's theme of looking at blockchains from a distributed systems perspective. Herlihy states that the paper is "colored by the perspective that much of the blockchain world is a disguised, sometimes distorted, mirror-image of the distributed computing world."

For a data processing perspective on blockchains, see this other paper.

Simple ledger example

The paper first introduces a simple ledger system using Alice's online news service as an example. To record articles as they arrive, Alice created a ledger that is implemented as a simple linked list. When an article arrives, it is placed in a shared pool, and a set of dedicated threads, called miners, collectively run a repeated protocol, called consensus, to select which item to append to the ledger. And to query for an article, a thread scans the linked-list ledger.

Is this a too simple, too constrained way of implementing the system? The paper says that the log-based system architecture has two compelling advantages:

  1. it is universal; it can implement any type of data structure, no matter how complex
  2. all questions of concurrency and fault-tolerance are compartmentalized in the consensus protocol 

Indeed this log/ledger based architecture is very popular for modern cloud-native distributed systems. This writeup, by Jay Kreps, tells you how important is the log/ledger abstraction for building distributed systems.

Kafka and BookKeeper are very popular platforms that enables compartmentalizing the concurrency and fault-tolerance in the framework's consensus protocol---realized by ZooKeeper. (As a side note, the private blockchain, Hyperledger v1.0.0-rc1, adopts a no-Byzantine consensus protocol based on Kafka.)

Finally, the Tango paper (SOSP'13) showed a good example of universality of logs: it showed how to build replicated in-memory data structures, and reliable and available distributed transactions and applications, on top of a shared log architecture. I believe the ideas explored in Tango paper can be used for efficiently maintaining multiple streams in the same log so that reading the whole chain is not needed for materializing the updates at the nodes.


Going from a centralized log implementation to a fully-decentralized public blockchain implementation needs some motivation. After her online news business, Alice wants to go to restaurant business. Like any trendy social influencer, she does an initial certificate sale (redeemable for meals) for her restaurant for raising capital.

This is somewhat like using Kickstarter for the restaurant. (Centralized is not de facto bad. Why would you not trust KickStarter? The market and the laws can straighten Kickstarter up, if it plays crooked.) However, the initial coin offering (ICO) adds more functionality on top of Kickstarter: you can sell/trade fractions of your token. In other words, the ICO creates a whole market around kickstarting.

The paper introduces the Proof-of-Work (PoW) idea using this example. Most miners are probably honest, content to collect their fees, but there is still a threat that even a small number of dishonest miners might collude with one another to cheat Alice’s investors. Alice’s first idea is to have miners, identified by their IP addresses, vote via a Byzantine fault-tolerant consensus algorithm. Alice quickly realizes this is a bad idea. Alice has a nemesis, Sybil, who is skilled in the art of manufacturing fake IP addresses. So Alice employs PoW for the blockchain. Sybil’s talent for impersonation is useless to her if each of her sock puppet miners must buy an expensive, long-shot lottery ticket.

PoW is a form of costly signaling: it is expensive in terms of time wasted and electricity bills. As a famous example, Bitcoin uses PoW for consensus: only a miner which has successfully solved a computationally hard puzzle (finding the right nonce for the block header) can append to the blockchain.

This is a good point for a concrete example. Luckily, this demonstration of PoW-based blockchain is a perfect way of doing that. If you haven't watched/tried this, take 15 minutes now to do it. Someone should give an award to Anders.

PoW has several shortcomings, of course. It is computationally very expensive and extremely wasteful for resources around the globe. Its throughput is miserable and its transactions are slow. It is nondeterministic: "It is still possible that two blocks are appended at the same time, creating a fork in the blockchain. Which block should subsequent miners build on? The usual answer is to build on the block whose chain is longest." So Bitcoin consolidates this by only considering a block as confirmed after it is followed by a number of blocks (typically six blocks).


The paper introduces smartcontracts using cross-chain swap as an example. Suppose Alice wants to trade some of her coupons to Bob in return for some bitcoins. Alice’s coupons live on one blockchain, and Bob’s bitcoin live on another, so they need to devise an atomic cross-chain swap protocol to consummate their deal. Naturally, neither one trusts the other.

I am acutely aware that technology is not the cure for everything. But, for the usecases where you can go all-digital, I think smartcontracts will be a great tool. It is an executable contract, so you can avoid notaries, lawyers, courts, and cops, because the failure clauses will be executed automatically. I think the smartcontracts idea is here to stay even when the fully-decentralized PoW-based blockchain approach dies.

On the other hand, the smartcontracts are not devoid of challenges either. The paper talks about the DAO attack, and makes a great point about the importance of concurrency control for smartcontracts: "We have seen that the notion that smart contracts do not need a concurrency model because execution is single-threaded is a dangerous illusion."  In a post last week, I had presented a TLA+/PlusCal modeling of the DAO attack.

MAD questions

Blockchain PoW Consensus vulnerabilities

The consensus problem, which is also at the heart of the blockchain, has been studied for more than 40 years in distributed systems. The PoW based blockchain takes a radical approach to implement it in an approximated manner.

Now, I am not saying distributed systems have consensus solved at scale. There is no solution that scales to the big number of participants that may be desired as in public blockchains. I wrote about it this earlier: "I love that Blockchains brings new constraints and requirements to the consensus problem. In blockchains, the participants can now be Byzantine, motivated by financial gains. And it is not sufficient to limit the consensus participants to be 3 nodes or 5 nodes, which was enough for tolerating crashes and ensuring persistency of the data. In blockchains, for reasons of attestability and tolerating colluding groups of Byzantine participants, it is preferred to keep the participants at 100s. Thus the problem becomes: How do you design a byzantine tolerant consensus algorithm that scales to 100s or 1000s of participants?"

On the other hand, the research on consensus in distributed systems literature has established a rigorous perimeter around the safety and liveness/progress guarantees that can be provided. The FLP impossibility result shows that it is impossible guarantee progress under a full asynchronous model with a crash failure---even with reliable channels. Paxos approach to solving consensus compartmentalizes the safety and progress properties nicely. Even under a fully asynchronous model (where all reasonable timing expectations break), Paxos preserves safety thanks to its balloting and anchoring system. Paxos also provides progress as soon as the partial synchrony kicks in and weak-complete & eventually-weak-accurate failure detectors are implementable (i.e., when we are out of the realm of the FLP result).

Because of this compartmentalization, we are guaranteed that Paxos's safety guarantees is not vulnerable to a timing attack where malicious parties break all reasonable timing assumptions about the system. This is a hard thing to get right, and several protocols have been shown to be vulnerable to timing attacks because they depended on some perfectly reasonable time synchronization assumptions.

It seems like the Bitcoin protocol makes several reasonable timing assumptions, but when an attacker manages to violate those timing assumptions, the safety of the protocol can be violated as well. 

How will the Bitcoin bubble end?

I had done a Twitter poll on this a while ago. What other options do you think are plausible? (Of course I am not claiming that this poll means anything.)

I still have a lot to learn about blockchains and cryptocurrencies, and in particular Bitcoin. Something I don't yet understand is whether it is possible for the ecosystem to erode in a slippery slope, tragedy of the commons fashion, to a majority Byzantine behavior. Say if a slightly different version of the protocol starts cutting some corners, and since that turns out to be advantageous financially more miners start adopting it, and soon that becomes the majority behavior and give rise to a hack (via a trojan) or slaying of the goose that lays the golden eggs.

While I don't know much about the dirty implementation details of Bitcoin, this incident doesn't give me much confidence that Bitcoin is decentralized, immutable, unforgeable, and unhackable.

What is the best medium for smartcontracts?

Is a scripting language the best medium? It is easy to screw that up as the DAO attack and ERC20 standard examples in the paper show. Maybe using a more declarative approach, and employing formal specifications and invariant-based reasoning to writing contracts could prove to be more resistant to errors.

What about applying model checking to smartcontracts? Could that help reduce the risks?

Sunday, January 28, 2018

Paxos derived

Lamport's fault-intolerant state machine replication algorithm

In 1978, Lamport published his classic "Time, Clocks, and the Ordering of Events in a Distributed System". As an application of logical clocks, he presented a distributed replicated state machine algorithm (and then he instantiated that algorithm to solve mutual exclusion as an example). Lamport complains that no one seemed to be aware of the distributed replicated state machine algorithm introduced in the paper:
"This is my most often cited paper. Many computer scientists claim to have read it. But I have rarely encountered anyone who was aware that the paper said anything about state machines. People seem to think that it is about either the causality relation on events in a distributed system, or the distributed mutual exclusion problem. People have insisted that there is nothing about state machines in the paper. I’ve even had to go back and reread it to convince myself that I really did remember what I had written."
I had talked about this distributed replicated state machine algorithm earlier. This algorithm is decentralized to a defect. It is not even tolerant to a single node failure. It assumes failure-free nodes.

The idea of the algorithm is as follows: In order to ensure that processes do not have different views of the order of updates, logical clocks is used to impose a total ordering on the updates. Each process keeps as part of its state the following: copy of the state, logical clock, queue of "modify requests" (with their logical time stamps), list of "known-times", one for every other process. Each process executes an update request on its copy of the state in increasing order of timestamps. For safety, all "known times" from other processes should be later than the time of the request.

The algorithm works as follows:
  1. Push your request in your own queue (timestamped with your logical clock)
  2. Broadcast your request to every node timestamp included.
  3. Wait for replies from all other nodes.
  4. If your request is now at the head of your queue and the known-times for other processes is ahead of its request timestamp (known-times is updated as processes send replies to the update request), enter critical section (where update to the state is done).
  5. Upon exiting the critical section, remove your request from the queue and send a release message to every process.

A fault-intolerant version of Paxos

I recently realized that the algorithm above (from the 1978 paper) constitutes a fault-intolerant instance of Paxos!

This occurred to me after thinking about it in the context of flexible quorums result. The flexible quorums idea (2016) states that we can weaken Paxos’ "all quorums should intersect" assertion to instead "only quorums from different phases should intersect". That is, majority quorums are not necessary for Paxos, provided that phase-1 quorums (Q1) intersect with phase-2 quorums (Q2).

This result allows trading off Q1 and Q2 sizes to improve performance (to the detriment of fault-tolerance)  Assuming failures and resulting leader changes are rare, phase-2 (where the leader tells the acceptors to decide values) is run more often than phase-1 (where a new leader is elected). Thus it is possible to improve performance of Paxos by reducing the size of Q2 at the expense of making the infrequently used Q1 larger. For example in a system of 10 acceptors, we can safely allow any set of only 3 acceptors to participate in Phase2, provided that we require 8 acceptors to participate for Phase1.  Note that the majority quorums (Q1=Q2=6) would be able to mask upto 5 node failures (f=5), whereas the Q1=8 configuration can only with stand upto 2 node failures (f=2) as it needs 8 nodes to be able to perform phase-1 if needed.

So, if you take Q1=N and Q2=1, the Paxos algorithm simplifies to the Lamport's distributed state machine replication algorithm above. Note that Q1=N implies the algorithm cannot tolerate any node failures, i.e., f=0. On the other hand, with this setup, you can combine phase 2 and phase 3 because you are writing to only one node, yourself. So phase 3 is non-existent in that algorithm.

The road from f=0 to Paxos

Ok, let's approach our claim from the other side as well. How do we take that f=0 protocol and strengthen it so that it doesn't block (lose progress) with one node failure?

This is how Phase 3 comes in to play as we add fault-tolerance. In order to tolerate one node crash  (in a fault-masking manner), you need Q2 to be 2. Then things suddenly get complicated, because you are not just writing to yourself, you will also need to write to another node in a guaranteed manner to persist the state. But, another leader may be stealing your turn before you can write to your other Q2 node your decision at Phase 2, so it is not safe to commit the update request! Therefore, Phase 2 clearing, which is phase 3, is needed to make this check, and it helps you replicate your state so it is preserved to the face of one node failure.

This is a point of objection, though. In Lamport's f=0 algorithm, logical clocks (LC) are used for reservation; every node respects LC, and puts requests into its queue ordered by LC. If one node needs to get its update done, it eventually will because the system is making progress. On the other hand, in Paxos, using the ballot numbers, for whose implementation LC could be used, a leader steals the previous leader's turn instead of patiently waiting the previous round to be complete. So what gives?

Well... In Lamport's f=0 algorithm, you could afford to be nice and patiently wait for each node to finish its turn, because f=0, and you are guaranteed to reach what you wait for. But when f>0 and a node can fail, you can't afford to wait for it to finish its turn (otherwise you would have to wait for an eternity in an asynchronous system model), and that is why Paxos is happy to change leaderships, and dueling leaders can arise (even to the point of violating progress).

In sum, something "fundamental" changes when you want to go fault-tolerant and tolerate node failure in an asynchronous system. When you combine faults and full-asynchrony, you get the FLP impossibility result. That means you lose progress! That is why Paxos does not guarantee making progress under a full asynchronous model with a crash failure. However, it preserves safety thanks to its balloting and anchoring system, and will provide progress as soon as the partial synchrony kicks in and weak-complete & eventually-weak-accurate failure detectors are implementable (i.e., when we are out of the realm of the FLP result). So, yes, there is a phase transition going from no faults to faults in asynchronous system.

I thank my PhD students, Ailidani Ailijiang and Aleksey Charapko, for discussion on this idea.

MAD questions

Was this actually how Leslie Lamport come up with the Paxos protocol? Does the 1978 fault-intolerant distributed state machine replication form a basis to evolve a fault-tolerant version?

I am not aware of any paper that makes this connection. Was this connection noticed and mentioned before?

Friday, January 26, 2018

Modeling the DAO attack in PlusCal

Maurice Herlihy's paper: "Blockchains from a distributed computing perspective" explains the DAO attack as follows:

"Figure 1 shows a fragment of a DAO-like contract, illustrating a function that allows an investor to withdraw funds. First, the function extracts the client's address (Line 2), then checks whether the client has enough funds to cover the withdrawal (Line 3). If so, the funds are sent to the client through an external function call (Line 4), and if the transfer is successful, the client’s balance is decremented (Line 5). 
This code is fatally  flawed. In June 2016, someone exploited this function to steal about $50 million funds from the DAO. As noted, the expression in Line 3 is a call to a function in the client's contract. Figure 2 shows the client's code. The client's contract immediately calls withdraw() again (Line 4). This re-entrant call again tests whether the client has enough funds to cover the withdrawal (Line 3), and because withdraw() decrements the balance only after the nested call is complete, the test erroneously passes, and the funds are transferred a second time, then a third, and so on, stopping only when the call stack overflows."
(Of course, that is a very simplified description of the DAO attack. More accurate descriptions are provided here and here.)

Even though the code seems sequential (after all the blockchain serializes everything), it has concurrency problems built in. This was a point made in Herlihy's paper as follows:
"In Ethereum, all contracts are recorded on the blockchain, and the ledger includes those contracts' current states. When a miner constructs a block, it fills that block with smart contracts and exe- cutes them one-by-one, where each contract's  final state is the next contract's initial state. These contract executions occur in order, so it would appear that there is no need to worry about concurrency." 
After showing DAO vulnerability and ERC20 token standard vulnerability, the paper says:
"We have seen that the notion that smart contracts do not need a concurrency model because execution is single-threaded is a dangerous illusion. Sergey and Hobor give an excellent survey of pitfalls and common bugs in smart contracts that are disguised versions of familiar concurrency pitfalls and bugs." 

Enter TLA+/PlusCal

Since TLA+/PlusCal is a great tool for catching concurrency problems, I thought it would be useful to model this DAO attack in PlusCal. After I got the idea, it took me a short time to model and model-check this in PlusCal. I used procedures in PlusCal (which I don't use often) to match the description of the problem.

TLA+ is all about invariant-based reasoning so I wrote the invariant first. Writing "SafeWithdrawal == (bankBalance=BALANCE /\ malloryBalance=0) \/ (bankBalance=BALANCE-AMOUNT /\ malloryBalance=AMOUNT)was too tight, because the updates of the balances are not happening atomically. That is how the invariant-based thinking helps us immediately: we can see that the withdrawal is a non-atomic operation, and realize that we should be more careful with the updates.

In the model checking pane, I set BALANCE as 10 and AMOUNT as 10. That is, initially Mallory has 10 coins in her bankBalance, and 0 in her wallet and wants to transfer her bankBalance and sets AMOUNT=10. When I run the model checker, it finds the double withdrawal problem immediately. Mallory's account got to 20 starting from 0! Normally we would expect it to go to 10 (line 27) temporarily, and then her bankBalance to be set to 0 (line 22). But this code managed to do double withdrawal, and the SafeWithdrawal invariant is violated.

The error trace contains 8 steps: Initially BankWithdraw is called, which then calls the MallorySendMoney to complete withdrawal. However, Mallory's SendMoney implementation includes another call to BankWithdraw and the balance check in line 18 passes because bankBalance is not decremented by amount (that comes in line 22). So the second BankWithdraw executes concurrently and Mallory manages to do double (and later triple) withdrawal.

Fixing things

Ok, let's check if we can fix this if we move the bankBalance subtraction before MallorySendMoney.
Of course for that we change SafeWithDrawal to accommodate the new way of updating bankBalance. But it turns out that is still too tight. If I call this with BALANCE=10 and AMOUNT=4, it is OK to have two withdrawals concurrently provided that in the final state no new money is produced: Invariant == bankBalance+malloryBalance <= BALANCE. I also model check for progress and write an EndState temporal formula for it: EndState == <>(bankBalance<=BALANCE-AMOUNT /\ bankBalance+malloryBalance=BALANCE). When we model check it, we see that this solves the problem.  So it leaves me puzzled, why, when it was this easy, the original BankWithdraw code was not coded this way and was left vulnerable to the attack.

These PlusCal models are available on my Github directory.

MAD questions

Should we come up with a PlusCal framework to facilitate modeling and model-checking of smart-contracts?

I had written about why you should model. Those apply here as well, and here things become even more critical. When money is involved, attackers get smart quickly, and it is easy to have vulnerabilities in concurrent code due to the many corner cases. Let TLA+/PlusCal show you those cornercases and help you design your protocol to achieve correctness guarantees. So if you are writing smartcontracts, I think it makes sense to first model-check and verify them. It doesn't take much effort, and it can save you from big problems.

Related links

Here is some previous discussion/context about why I started assigning TLA+/PlusCal modeling projects in distributed systems classes.

There is a vibrant Google Groups forum for TLA+ :!forum/tlaplus

Clicking on label "tla" at the end of the post you can reach all my posts about TLA+