Dream Theatre

Paper Review Series : Dynamo

2018-10-22T00:00:00+00:00

Dynamo is a fully distributed key-value store developed by Amazon that is highly available and scalable with compromise on consistency.

highly availability for writes is achieved by hinted handoff, that is, when is node is down, the writes routed to this node will be redirected to next available node on the hash ring. The writes metadata will have a hint that indicate the writes should be done on the node that is currently unavailable. When the failed node is up again, the writes would be replicated to the node.
Eventual consistency is achieved by vector clock and quorum like voting. By configuring W, R and N, users can have different level of SLA.
High incremental scalability and load balancing efficiency is achieved by a consistent hashing algorithm to partition data over nodes of the system in which each node in the system is assigned to multiple points in the hash ring.

Paper Review Series : Chord

2018-10-17T00:00:00+00:00

Chord is a scalable protocol for lookup service in a dynamic peer-to-peer system with frequent node arrivals and departures. Chord is meaningful as efficient location is import in a decentralised system. The paper discussed he system model that motivates the Chord protocol and proved several of its properties. It also addressed handling of concurrent joins and leaves. Finally it examined experiment results. Strengths:

Basically the idea of the lookup algorithm Chord uses is like binary search. Nodes form a topological ring. Each node, instead of storing information about all other nodes, stores only nodes of double distance away on the ring. To look up the node for some key, hop to the nearest predecessor and halve the distance to the target node. It’s proven that with high probability, in logarithmic times of iterations the search will converge. This lookup algorithm makes Chord highly scalable and lookup performance improved.
The “stabilization” protocol, which is all nodes periodically checks its successors’ updatedness, guarantees correctness of lookups in case of nodes joining and leaving. It makes fully decentralised system, no master node or hierarchical nodes, and no need global awareness.
The look up algorithm is simple and proven correct. Weakness:
Since Chord disregard the physical network topology and assume hops between any nodes are of same cost, a search that involves hops between nodes that are residing far away in the network can be very costly.

Paper Review Series : Apache Flink

2018-10-03T00:00:00+00:00

Apache Flink is an open source system for stream and batch processing. Traditionally, stream data and batch data processing are deemed very different applications and are approached by different models, APIs and systems. Apache Flink, however, takes that batch processing is a special case of stream processing and stream processing model can be the unifying framework for both problems. This paper illustrated a unified architecture of stream and batch data processing Apache Flink is built upon. It showed how streaming, batch, iterative, and interactive analytics can be represented as fault-tolerant streaming dataflows. It then discussed how to build a stream analytics system with a flexible windowing mechanism and a batch processor on top of these dataflows. 1, The iterative computations on input data stream along the DAG is based on buffer exchange. The computations are in-memory and thus fast. The use of buffer stream makes back pressure propagated to producer. It’s, to my understanding, very similar to producer consumer paradigm where back producer cannot produce if consumer has not finish consuming. 2, Flink insert into stream data “barriers” regularly. These barriers move with input data stream in the DAG but are not processed. They mark the checkpoints for operators snapshotting their states. The snapshotting can be asynchronous and incremental thus does not stop processing. This snapshotting mechanism is independent of processing logic and thus is decoupled from control messages. It’s also unrelated to external storage usage. 3, Unlike Spark, Flink thinks of stream as the unifying paradigm and batch processing is a special case of stream processing over a bounded data set. Batch processing can be fulfilled by Flink streaming model by inserting all input data into a window. On top of stream model, batch processing is also optimised, such as, simpler syntax, blocking operators, optimised queries, dedicated API. Since data is static, snapshotting for fault-tolerance can be turned off as well.

Paper Review Series : Google File System

2018-10-01T00:00:00+00:00

Google File System(GFS) a scalable distributed data-intensive file system designed and implemented by Google. While sharing features of traditional file systems, it is designed with new issues/concerns taken into considerations, such as frequent component failures, files of extra large sizes, append-only updates, coupling of file system and applications, throughput over latency, efficient multi-client appending, etc. It discussed the design overview, how system works, metadata, and garbage collection. It also assessed the high availability, fault tolerance and diagnosis. Finally it showed the performance benchmark and analysis. The system consists of a single master, multiple chunk servers and clients. Files are divided into chunks and are stored on chunk servers. Master server stores all file system metadata including the namespace and the current locations of chunks. It also manages system activities like garbage collection, chunk lease and chunk migration. Clients are interface between the system and applications. The topology of the system is simple and clear. It can be deployed on commodity machines and can achieve linear scalability by adding more nodes. Reliability and high availability is fulfilled by replica and shadow mechanism. Every chunk is replicated onto multiple chunk servers. Both master and chunk server are designed to be able to restart and recover fast. Master states are also replicated onto “shadow” masters. If the master fails, shadow masters provide read-only access to the file system. For data integrity, chunk server uses checksumming. If data corruption detected, chunk server will inform master and requests are directed to other replicas. The master creates a new uncorrupted replica and delete the corrupted replica.

Paper Review Series : Resilient Distributed Dataset

2018-09-26T00:00:00+00:00

Resilient Distributed Dataset(RDD) is a memory abstraction that brings about data reuse efficiency. Programmers can explicitly partition and persist intermediate data and manipulate it with RDD operators, thus avoiding I/O cost. RDD provides fault-tolerance by making RDD immutable. Updating RDD is in fact transforming it to another RDD. This transformation, called lineage, is logged and can be used to re-construct the original RDD, thus providing a fault-tolerance mechanism. Strengths

RDD the abstraction increases data reusability, thus saving I/O cost which can be significant for interactive tasks and iterative computations.
RDD being immutable provides a trivial fault-tolerance mechanism. Updates are in fact creating new RDDs.
RDD works like a production line, which enhances performance significantly. Weaknesses: 1.It seems RDD requires considerably large amount of memory. And it’s implementation is based on JVM. Therefore it can’t manage memory directly and efficiently. 2.It seems Spark requires large amount of memory to be efficient. While MapReduce persist intermediate results into external storage, which is painfully I/O expensive, RDD is kept in memory. For iterative and interactive tasks where intermediate data is frequent produced, this can boost the performance dramatically. The main challenge is fault-tolerance mechanism. RDD addresses this issue essentially by a workaround. RDD is designed to be read-only. Therefore consistency is trivial. Updates on data is achieved by making new RDDs. These changes are logged, and can be used to re-construct the previous RDD. It’s almost the same idea as marking the edits instead of actual data in version controlling. Finally, a MapReduce server works individually basically like a workshop. It does its own resource management, job scheduling, etc while RDD works like a production line(pipeline). This makes RDD more efficient. RDD has to be persisted into disk if memory is not enough. Scala is JVM based so it cannot manage machine memory directly I guess. Therefore JVM overhead could be potentially problematic.

Paper Review Series : Mapreduce

2018-09-19T00:00:00+00:00

MapReduce is a programming paradigm where Map and Reduce are two functions in which many real word problems can be expressed or framed. The advantage is problems framed into this paradigm is automatically paralleled. The implementation will deal with distributing computations onto commodity servers so that programmer can code the logic without knowing about distributed system. This paper discusses the problems the implementation of MapReduce and how it can be used. MapReduce interface is very simple to use. Users basically only have to write their business logic into the Map and Reduce functions which are fairly straightforward as well. On the other hand, that means a relatively involved job might require several map and reduce functions, which is only cumbersome to write, but also implies performance issues, as one map function that is slow may thwart execution of subsequent and dependent functions. The system runs on commodity hardwares and fault tolerance and failure recovery are simple. Basically, in case of worker failure, jobs are re-executed excepted for completed reduce job. If a master fails, clients check and decide whether to retry. These combined makes the system very scalable.
One thing that is insufficiently discussed is job scheduling and resource allocation policy, both of which are very important in distributed system.

Paper Review Series : C Store Dbms

2018-09-12T00:00:00+00:00

Traditional row-based relational database is write-optimised, which is not so effective for ad hoc query of large amount of data. To address the issue, this paper presents the design of a column-based relational database. The distinctiveness of the design is its column organisation of data, which conduces to various techniques of data compression and hence read performance improvement. The paper also presents a non-traditional transaction implementation of read-only transactions. It separates reads and writes and avoids use of lock on reads. It mainly applies snapshot isolation and periodic synchronisation.

Paper Review Series : Google

2018-09-07T00:00:00+00:00

This paper proposes Google, a large scale hypertextual web search engine. It describes in detail efficient crawling and indexing the Web and mechanisms for much higher precision search result. Google used a combination of data structures and algorithms for crawling and indexing performance in terms of time and space.It introduces “PageRank”, a metric computed based on backlinks to prioritise search results. The paper shows its performance result and discusses futures works such as high quality search and scalability. Strengths

The paper identifies the quality issue of search research and addresses it with “PageRank” ,which is a metric “PageRank” defined on webpages to order webpages in their relevance/importance , and anchor propagation.
Optimised data structures like BigFile make crawling, indexing, and searching with little cost.
Distributed system architecture like crawling servers is conducive for future scale-up. Weaknesses (i.e. Limitations or Possible Improvements)
Common techniques like querying cache, sub-indexing is not mentioned.
The paper does not mention that parsing, indexing, and searching is or could be partitioned and distributed, although Google at the time is prototypical.
Manipulation of the system for higher search result ranking seems to be important issue and not sufficiently elaborated.

While previous search engine can achieve fast and complete indexing, it’s realised that people are only able/interested in viewing the most relevant ones. The quality or precision then should be where newer effort on search engine be directed. This paper proposed a system “Google” that addresses the issue. It introduced a new metric “PageRank”, based on which, search results of web pages are prioritized. “PageRank” of a web page is iteratively computed by the weighted “PageRank” of backlinks with a damping factor. The system also applied anchor propagation technique used in World Wide Web Worm. Anchor propagation means that the text of a link is not only associated with the page that the link is on, but also with the page the link points to.It has two benefits. First, the text of links often provide more accurate description of web pages than the pages themselves. Second, web pages that are not hypertextual thus not crawlable such as videos, images, databases and so on, can be returned as result.

Search engine working on a large domain of web pages could require impractically large storage and be slow consequently if not dealt with proper data structure. Google is designed to avoid disk I/O as much as possible. Although not detailed in the paper, the BigFile, the hand optimized hitlist, the lexicons of several forms supposedly play important roles in reducing space consumption yet keeping fetching records efficient.

Google anticipates the explosion of web content and adopted distributed architecture for crawling servers, which is a major part of the search engine. It attains impressive performance and can scale up easily with more hardware. It in a sense pioneered and directed the advancement of the internet technology. However, other processes like indexing can also be made distributed and the paper does not mention that. Further, a distributed system could be partitioned for many benefits, which is also not brought up in the paper.

Common techniques like querying cache and sub-indexing/secondary indexing are also not touched upon. The reasons could be well legitimate, for instance, that web pages change constantly so that caching make little sense, or that sub-indexing is not cost effective in terms of time cost and search speed, but a brief discussion could always be better.

Last rather import issue, how the system can be immune from manipulation for higher ranking is not clear. Based on the information given in the paper, it seems feasible to reverse engineer web pages so that they have higher ranking in potential search result. The paper mentions that the damping factor in computing “PageRank” enables personalisation which can prevent such scenario, however, no mechanism or idea is discussed.

Roadmap Of “knowing” Neural Network

2017-12-27T00:00:00+00:00

I say "knowing" because I’m no longer sure of what the word "learn" means exactly, epistermologically. I’ve been studing data science for 2 semesters. It seems the big data hyme is over and everybody is talking about deep learning, AI, etc. now, but as a matter of fact, neural network IS very powerful in solving certain problems, though in a mysterious way. It will change the landscape of many sections in the ecomonic sections for sure. As a cs student or professional, having some deep learning literacy may be rather desirable. Most people are unable to invent new algorithms or models, even with a PhD, but only knowing how to use libs and tune params is not enough. We may need a broad, not too detailed, but working understanding. Here’s a roadmap to acquire that kind of understanding. First and foremost, some grasp of theory is a must. The book by Ian Goodfellow et al should suffice. If you’ve been off campus for long, recap of some rudimentary statitics may also be necessary.
Then get your hands dirty. Implement some naive NN, a MLP, a vanilla CNN or RNN without using libs. Some keys of designing a model like dataflow exchange, forward and backword and so on may be worth your additional attention.
Finally you need to work with libs. After all, that’s what the vast majority of AI/deep learnring/machine learning engineers do. There are many. Pytorch and tensorflow in order is qutie orthodox. The learning curve beginning with pytorch is steady. There are many tutorials online you can follow.

Set-up Ethereum Development environment

2017-01-20T00:00:00+00:00

Here’s a brief environment setup how-to for Ethereum development. It’s based on Mac OS.

Install Python2.7.
Install solc, the solidity compiler, and solc-cli
sudo npm install -g solc solc-cli --save-dev
Install ethereum/cpp-ethereum, via brew
brew tap ethereum/ethereum
brew install ethereum
Install testrpc (to deploy smart contract as local test environment) via pip.
Install Node.js.
Install truffle(for fast local compiling and deploying smart contracts).
npm install -g truffle