Paper Review Series : Dynamo

Dynamo is a fully distributed key-value store developed by Amazon that is highly available and scalable with compromise on consistency.

highly availability for writes is achieved by hinted handoff, that is, when is node is down, the writes routed to this node will be redirected to next available node on the hash ring. The writes metadata will have a hint that indicate the writes should be done on the node that is currently unavailable. When the failed node is up again, the writes would be replicated to the node.
Eventual consistency is achieved by vector clock and quorum like voting. By configuring W, R and N, users can have different level of SLA.
High incremental scalability and load balancing efficiency is achieved by a consistent hashing algorithm to partition data over nodes of the system in which each node in the system is assigned to multiple points in the hash ring.

Paper Review Series : Chord

Chord is a scalable protocol for lookup service in a dynamic peer-to-peer system with frequent node arrivals and departures. Chord is meaningful as efficient location is import in a decentralised system. The paper discussed he system model that motivates the Chord protocol and proved several of its properties. It also addressed handling of concurrent joins and leaves. Finally it examined experiment results. Strengths:

Basically the idea of the lookup algorithm Chord uses is like binary search. Nodes form a topological ring. Each node, instead of storing information about all other nodes, stores only nodes of double distance away on the ring. To look up the node for some key, hop to the nearest predecessor and halve the distance to the target node. It’s proven that with high probability, in logarithmic times of iterations the search will converge. This lookup algorithm makes Chord highly scalable and lookup performance improved.
The “stabilization” protocol, which is all nodes periodically checks its successors’ updatedness, guarantees correctness of lookups in case of nodes joining and leaving. It makes fully decentralised system, no master node or hierarchical nodes, and no need global awareness.
The look up algorithm is simple and proven correct. Weakness:
Since Chord disregard the physical network topology and assume hops between any nodes are of same cost, a search that involves hops between nodes that are residing far away in the network can be very costly.

Paper Review Series : Apache Flink

Apache Flink is an open source system for stream and batch processing. Traditionally, stream data and batch data processing are deemed very different applications and are approached by different models, APIs and systems. Apache Flink, however, takes that batch processing is a special case of stream processing and stream processing model can be the unifying framework for both problems. This paper illustrated a unified architecture of stream and batch data processing Apache Flink is built upon. It showed how streaming, batch, iterative, and interactive analytics can be represented as fault-tolerant streaming dataflows. It then discussed how to build a stream analytics system with a flexible windowing mechanism and a batch processor on top of these dataflows. 1, The iterative computations on input data stream along the DAG is based on buffer exchange. The computations are in-memory and thus fast. The use of buffer stream makes back pressure propagated to producer. It’s, to my understanding, very similar to producer consumer paradigm where back producer cannot produce if consumer has not finish consuming. 2, Flink insert into stream data “barriers” regularly. These barriers move with input data stream in the DAG but are not processed. They mark the checkpoints for operators snapshotting their states. The snapshotting can be asynchronous and incremental thus does not stop processing. This snapshotting mechanism is independent of processing logic and thus is decoupled from control messages. It’s also unrelated to external storage usage. 3, Unlike Spark, Flink thinks of stream as the unifying paradigm and batch processing is a special case of stream processing over a bounded data set. Batch processing can be fulfilled by Flink streaming model by inserting all input data into a window. On top of stream model, batch processing is also optimised, such as, simpler syntax, blocking operators, optimised queries, dedicated API. Since data is static, snapshotting for fault-tolerance can be turned off as well.

Paper Review Series : Google File System

Google File System(GFS) a scalable distributed data-intensive file system designed and implemented by Google. While sharing features of traditional file systems, it is designed with new issues/concerns taken into considerations, such as frequent component failures, files of extra large sizes, append-only updates, coupling of file system and applications, throughput over latency, efficient multi-client appending, etc. It discussed the design overview, how system works, metadata, and garbage collection. It also assessed the high availability, fault tolerance and diagnosis. Finally it showed the performance benchmark and analysis. The system consists of a single master, multiple chunk servers and clients. Files are divided into chunks and are stored on chunk servers. Master server stores all file system metadata including the namespace and the current locations of chunks. It also manages system activities like garbage collection, chunk lease and chunk migration. Clients are interface between the system and applications. The topology of the system is simple and clear. It can be deployed on commodity machines and can achieve linear scalability by adding more nodes. Reliability and high availability is fulfilled by replica and shadow mechanism. Every chunk is replicated onto multiple chunk servers. Both master and chunk server are designed to be able to restart and recover fast. Master states are also replicated onto “shadow” masters. If the master fails, shadow masters provide read-only access to the file system. For data integrity, chunk server uses checksumming. If data corruption detected, chunk server will inform master and requests are directed to other replicas. The master creates a new uncorrupted replica and delete the corrupted replica.

Paper Review Series : Resilient Distributed Dataset

Resilient Distributed Dataset(RDD) is a memory abstraction that brings about data reuse efficiency. Programmers can explicitly partition and persist intermediate data and manipulate it with RDD operators, thus avoiding I/O cost. RDD provides fault-tolerance by making RDD immutable. Updating RDD is in fact transforming it to another RDD. This transformation, called lineage, is logged and can be used to re-construct the original RDD, thus providing a fault-tolerance mechanism. Strengths

RDD the abstraction increases data reusability, thus saving I/O cost which can be significant for interactive tasks and iterative computations.
RDD being immutable provides a trivial fault-tolerance mechanism. Updates are in fact creating new RDDs.
RDD works like a production line, which enhances performance significantly. Weaknesses: 1.It seems RDD requires considerably large amount of memory. And it’s implementation is based on JVM. Therefore it can’t manage memory directly and efficiently. 2.It seems Spark requires large amount of memory to be efficient. While MapReduce persist intermediate results into external storage, which is painfully I/O expensive, RDD is kept in memory. For iterative and interactive tasks where intermediate data is frequent produced, this can boost the performance dramatically. The main challenge is fault-tolerance mechanism. RDD addresses this issue essentially by a workaround. RDD is designed to be read-only. Therefore consistency is trivial. Updates on data is achieved by making new RDDs. These changes are logged, and can be used to re-construct the previous RDD. It’s almost the same idea as marking the edits instead of actual data in version controlling. Finally, a MapReduce server works individually basically like a workshop. It does its own resource management, job scheduling, etc while RDD works like a production line(pipeline). This makes RDD more efficient. RDD has to be persisted into disk if memory is not enough. Scala is JVM based so it cannot manage machine memory directly I guess. Therefore JVM overhead could be potentially problematic.

Paper Review Series : Mapreduce

MapReduce is a programming paradigm where Map and Reduce are two functions in which many real word problems can be expressed or framed. The advantage is problems framed into this paradigm is automatically paralleled. The implementation will deal with distributing computations onto commodity servers so that programmer can code the logic without knowing about distributed system. This paper discusses the problems the implementation of MapReduce and how it can be used. MapReduce interface is very simple to use. Users basically only have to write their business logic into the Map and Reduce functions which are fairly straightforward as well. On the other hand, that means a relatively involved job might require several map and reduce functions, which is only cumbersome to write, but also implies performance issues, as one map function that is slow may thwart execution of subsequent and dependent functions. The system runs on commodity hardwares and fault tolerance and failure recovery are simple. Basically, in case of worker failure, jobs are re-executed excepted for completed reduce job. If a master fails, clients check and decide whether to retry. These combined makes the system very scalable.
One thing that is insufficiently discussed is job scheduling and resource allocation policy, both of which are very important in distributed system.

Paper Review Series : C Store Dbms

Traditional row-based relational database is write-optimised, which is not so effective for ad hoc query of large amount of data. To address the issue, this paper presents the design of a column-based relational database. The distinctiveness of the design is its column organisation of data, which conduces to various techniques of data compression and hence read performance improvement. The paper also presents a non-traditional transaction implementation of read-only transactions. It separates reads and writes and avoids use of lock on reads. It mainly applies snapshot isolation and periodic synchronisation.

Paper Review Series : Google

This paper proposes Google, a large scale hypertextual web search engine. It describes in detail efficient crawling and indexing the Web and mechanisms for much higher precision search result. Google used a combination of data structures and algorithms for crawling and indexing performance in terms of time and space.It introduces “PageRank”, a metric computed based on backlinks to prioritise search results. The paper shows its performance result and discusses futures works such as high quality search and scalability. Strengths

The paper identifies the quality issue of search research and addresses it with “PageRank” ,which is a metric “PageRank” defined on webpages to order webpages in their relevance/importance , and anchor propagation.
Optimised data structures like BigFile make crawling, indexing, and searching with little cost.
Distributed system architecture like crawling servers is conducive for future scale-up. Weaknesses (i.e. Limitations or Possible Improvements)
Common techniques like querying cache, sub-indexing is not mentioned.
The paper does not mention that parsing, indexing, and searching is or could be partitioned and distributed, although Google at the time is prototypical.
Manipulation of the system for higher search result ranking seems to be important issue and not sufficiently elaborated.

Roadmap Of "knowing" Neural Network

I say "knowing" because I’m no longer sure of what the word "learn" means exactly, epistermologically. I’ve been studing data science for 2 semesters. It seems the big data hyme is over and everybody is talking about deep learning, AI, etc. now, but as a matter of fact, neural network IS very powerful in solving certain problems, though in a mysterious way. It will change the landscape of many sections in the ecomonic sections for sure. As a cs student or professional, having some deep learning literacy may be rather desirable. Most people are unable to invent new algorithms or models, even with a PhD, but only knowing how to use libs and tune params is not enough. We may need a broad, not too detailed, but working understanding. Here’s a roadmap to acquire that kind of understanding. First and foremost, some grasp of theory is a must. The book by Ian Goodfellow et al should suffice. If you’ve been off campus for long, recap of some rudimentary statitics may also be necessary.
Then get your hands dirty. Implement some naive NN, a MLP, a vanilla CNN or RNN without using libs. Some keys of designing a model like dataflow exchange, forward and backword and so on may be worth your additional attention.
Finally you need to work with libs. After all, that’s what the vast majority of AI/deep learnring/machine learning engineers do. There are many. Pytorch and tensorflow in order is qutie orthodox. The learning curve beginning with pytorch is steady. There are many tutorials online you can follow.

Set-up Ethereum Development environment

Here’s a brief environment setup how-to for Ethereum development. It’s based on Mac OS.

Install Python2.7.
Install solc, the solidity compiler, and solc-cli
sudo npm install -g solc solc-cli --save-dev
Install ethereum/cpp-ethereum, via brew
brew tap ethereum/ethereum
brew install ethereum
Install testrpc (to deploy smart contract as local test environment) via pip.
Install Node.js.
Install truffle(for fast local compiling and deploying smart contracts).
npm install -g truffle

Thoughts on Functional Programming

Functional programming comes back another time in recent year and not without momentum. Many imperative languages are adding functional capabilities, like Java, Python. Some, while proclaiming to be functional, are more procedural, and therefore confusing sometimes. Functional Thinking is a great book to understand what functional programming really is, especially after you have coded in some of them and have a first hand experience. Once I was to upgrade a legacy system I found that many java codes are written in C style with many parameters, no encapsulation, no interface design, etc. It happens too, in the FP scene. Many people, although writing FP codes, are thinking in a procedural or imperative way. But the way people think does not come out of thin air. The way people think in programming are heavily influenced by the programing language they write in. An aptly designed programming language can help people understand and write better FP codes. For example, as a high level interpretation language, R is not fast. So, instead to write your own functions with loops, procedures, you are compelled toi use vectorized functionals as many a possible. Each of them is an independent stage of the pipe.The logic involved is more easily to be sorted out and the refactoring and testing made easier as well. For the functionals to be independent, side-effects are to be avoid as much as possible. That is, the code should be more self-contained. In this way, R helps us shape our thinking in a functional way. The fundamental feature of functional programming is that, function, or functional, or operator mathematically a set of sets, second order object, is treated as first order object. They can be passed, reused, serialised, persisted. Mathematically it’s much more expressive than imperative languages like Java, which mathematically is first order. It seems just different way of thinking but actually reflects the deep understandings of computation, data, and algorithm.

Routine Checks on GC

Just fixed CPU usage burst problem caused by abnormal GC stop. It turns out that one sorting operation on a very large array would take up the young generation heap of default size very quickly. Usually the sage says don’t temper with JVM settings unless you know what you are doing. Many are intimidated and follow the wisdom. However, settings are for users to set, aren’t they. Sometimes the problems requires fine-tuning the JVM and some caused by improper parameters .

Here are some basic things:
First of all, understand the characteristic of your program with regard to the object creation.
If you are using frameworks like Spring, do understand the part you are using.
Take a heap dump and see what are those objects which trigger the GC.
Consider possible memory leak.
If the GC is young gen generation, you may consider to increase the young gen, also change SurvivorRatio. If it’s the full gc, you will have to increase heap.

Workshop for CQL

CQL, stands for Query Language for Cassandra.Compared to SQL, it’s clearly limiting in its grammar. This is largely to avoid inefficient queries for Cassandra is mainly used to store massive data in distributed systems. Data are hashed against partition key and then distributed to each node to store. Therefore one key in designing a cassandra query is to scan less as possible nodes as it can be very inefficient.

We know that relational database is a set of rows. Cassandra is a set of partitions. Without cluster keys, a partition is simple a single row. A partition that consists of multiple rows are simply wide-row. It’s the hash value of partition key that cassandra use to decide which node data should be stored. It’s a hash indexing, not a range. We need clustering key to order the data and give a range and according to which query is made possible. An example may be more illustrative.

For the table below, node is the partition key, while (date,number) is the clustering key.

CREATE KEYSPACE test WITH REPLICATION = {  
    'class': 'SimpleStrategy',   
    'replication_factor': 1   
}

CREATE TABLE log(    
    node text,    
    date text,    
    name text,  
    number int,  
    Primary Key(node,date,number)`  
)

A Comparision Between MongoDB and Cassandra

MngoDB can be a great alternative to MySQL, but it’s not really appropriate for the scale-out applications targeted by Cassandra. Still, as early members of the NoSQL category, the two do draw comparisons.

Sharing On Hbase

Recently I learnt some materials on HBase, not directly relevant to our work, but good to share any way.
HBase stands for Hadoop Database. It provides Hadoop what BigTable provides for Google file system. It’s non-relational, column(column family) based and distributed.

A HBase system mainly consists of ZooKeeper, HMaster and HRegionServer.

Notes On Java Nio

Java IO is designed te work with stream. Each time one or more bytes are read from stream until the end. They are not buffered. The reader can’t move forward or backward in the stream as well. If that is needed, data from stream has to be stored somewhere first.
Java NIO, instead, works with buffer. Data are read to a buffer which user can manipulate later. More flexibility.
For instance, with InputStream or Reader, data from a row based txt file, is read consecutively, in this case, line by line.

Some Tips I follow

Years back, when I first came to the realm of programming I looked for sages and hoped to learn from them. Some tips I found turned out to be truely valueable along these years. I can’t remember from where I take them. but anyway.

Decompose the problem to manageable size. This is the essential spirit of Software development.

Organize the architecture of your code.

Every package should have 5~20 classes. If more than 20, you should consider decompose the package to sub-packages. Reason: easy to manage.