Day1 Friday, October 20th, Guizhou Hall

10:00-10:20 Coffee Break
10:20-12:00 Session 3  Regular Paper: Applications and Algorithms
Chair: Cheng Li
Xiangyu Ju, Quan Chen, Zhenning Wang, Minyi Guo and Guang R. Gao,
DCF: A Dataflow-based Collaborative Filtering Training Algorithm ( Paper   |  PPT)
Emerging recommender systems often adopt collaborative filtering techniques to improve the recommending accuracy. Existing collaborative filtering techniques are implemented with either Alternating Least Square (ALS) algorithm or Gradient Descent (GD) algorithm. However, both of the two algorithms are not scalable because ALS suffers from high computation complexity and GD suffers from severe synchronization problem and tremendous data movement. To solve the above problems, we proposed a Dataflow-based Collaborative Filtering (DCF) algorithm. More specifically, DCF exploits fine-grain asynchronous feature of dataflow model to minimize synchronization overhead; leverages mini-batch technique to reduce computation and communication complexities; uses dummy edge and multicasting techniques to avoid finegrain overhead of dependency checking and reduce data movement. By utilizing all the above techniques, DCF is able to significantly improve the performance of collaborative filtering. Our experiment on a cluster with one master node and ten slave nodes show that DCF achieves 23x speedup over ALS on Spark and 18x speedup over GD on Graphlab in public datasets.
Han Lin, Zhichao Su, Xiandong Meng, Zhong Wang and Hong An
Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive ( Paper   |  PPT)
Metagenomics, the study of all microbial species cohabitants in an environment, often produces large amount of sequence data varying from several GBs to a few TBs. Analysing metagenomics data involving several steps, some steps are data intensive, and some are compute intensive. Typical bioinformatics pipelines attempt to analyse the entire data set on computer servers with several terabytes of RAM, which is very inefficient. To overcome this limit, here we propose a MapReduce based solution to partition the data based on their species of origin. We implemented the solution using BioPig, an analytic toolkit for large-scale genomic sequence data based on Apache Hadoop and Pig. We simplified data types and logic design, compressed k-mer storage and combined Hadoop with MPI to improve the computational performance. After these optimizations, we achieved up to 193x speedup for the rate-limiting step and 8x speedup for the entire pipeline, respectively. The optimized software is also capable to process datasets that are 16 times larger on the same hardware platform. Results from this case study suggest the combined Hadoop with MPI approach has great potential in large genomics applications that are both data-intensive and compute-intensive.
Zhenjie Wang, Linpeng Huang and Yanmin Zhu
SCMKV: A Lightweighted Log-structured Key-value Store on SCM ( Paper   |  PPT)
Storage Class Memories (SCMs) are promising technologies that would change the future of storage, with many attractive capabilities such as byte addressability, low latency and persistence. Existing key-value stores proposed for block devices use SCMs as block devices, which conceal the performance that SCMs provide. A few existing keyvalue stores for SCMs fail to provide consistency when hardware supports such as cache flush on power failure are unavailable. In this paper, we present a key-value store called SCMKV that provide consistency, performance and scalability. It takes advantage of characteristics of key-value workloads and leverages the log-structured technique for high throughput. In particular, we propose a static concurrent cache-friendly hash table scheme to accelerate accesses to key-value objects, and maintain separate data logs and memory allocators for each worker thread for high concurrency. To reduce write latency, it tries to reduce writes to SCMs and the number of cache flushing instructions. Our experiments show that SCMKV has much higher throughput and considerable scalability than state-of-the-art key-value systems.
Zhen Guo, Wenbin Yao and Dongbin Wang,
A Virtual Machine Migration Algorithm Based on Group Selection in Cloud Data Center ( Paper   |  PPT)
Live migration of virtual machine (VM) is a promising technology that helps physical machines (PMs) adapt to load changes, and guarantees Quality of Service (QoS) in cloud data center. Many individual-based VM migration studies ignore the association between VMs, resulting in high communication cost. Some researches on multiple VMs migration migrate the VM group as a whole, which is likely to result in ineffective migration and increase the network burden. In this paper, a VM migration algorithm based on group selection (VMMAGS) is proposed, which takes the migration cost, communication cost, and VM heat into account to optimize migration performance. The appropriate VM groups are selected as migration options, and the optimal migration scheme is obtained according to the integration cost of partitions of selected VM groups. Extensive experiments show that our algorithm can effectively reduce the migration cost and communication cost, improve the system reliability compared with other related algorithms.
Baolin Sun, Chao Gui, Ying Song and Hua Chen
Adaptive Length Sliding Window-based Network Coding for Energy Efficient Algorithm in MANETs ( Paper   |  PPT)
A key problem in network coding (NC) lies in the complexity and energy consumption associated with the packet decoding processes, which hinder its application in mobile ad hoc networks (MANETs). Sliding-window Network Coding is a variation of NC that is an addition to TCP/IP and improves the throughput of TCP on MANETs. In this paper, we propose an Adaptive Length Sliding Window-based Network Coding for Energy Efficient algorithm in MANETs (ALSW-NCEE). The performance of this ALSW-NCEE is studied using NS2 and evaluated in terms of the throughput, packet loss probability, and decoding delay when packet is transmitted. The simulations result shows that the ALSW-NCEE with our proposition can significantly improve the network throughput and achieves higher diversity order.
12:25-14:00 Lunch Break
14:00-18:00 Session 4: Panel I      Dataflow and Big Data──from Academic Research to Industry Practice (1)
Chair: Hong An
Bengbu Hall
Guang R. Gao, University of Delaware, USA
Resurgence of Dataflow Models- A Time of Reflection ( Paper   |  PPT)
Aaron Smith, Microsoft, USA
From Datacenters to Client Devices – How Microsoft is Preparing for the End of Moore's Law ( Paper   |  PPT)
In this talk I will discuss two projects at Microsoft that deal with the end of Moore’s Law and silicon scaling. Project Catapult uses reconfigurable computing to accelerate datacenter services such as Bing search and Azure networking in Microsoft datacenters. Project E2 is a next-generation Explicit Data Graph Execution (EDGE) architecture that utilizes a hybrid von Neuman dataflow model to overcome the limitations of traditional CISC/RISC instruction set architectures.
Greg Wright, Qualcomm, USA
Practical dataflow – von Neumann hybrid architecture ( Paper   |  PPT)
In this short talk, we will consider the relative strengths and weaknesses of the dataflow and conventional von Neumann models, and how they can be combined to obtain the best features of each for high performance and power efficiency, while maintaining compatibility with existing high level software stacks and programming models.
Dongrui Fan, Institute of Computing Technology, China
SmarCo – Design High-end Processor with Dataflow execution model ( Paper   |  PPT)
Dataflow architecture are becoming important role in high-end computing. In this short talk, we will present a feasible design method of dataflow processors and our newborn typical dataflow processor – SmarCo, which proves the outstanding efficiency of dataflow execution model with higher performance.
15:25-16:00 Coffee Break & Group Photo
16:00-18:00 Session 5 Panel II      Dataflow and Big Data──from Academic Research to Industry Practice (2)
Chair: Guang R. Gao
Bengbu Hall
Michael Gschwind, IBM Corp, USA
High Performance Computing is the foundation for creating value from Big Data with AI ( Paper   |  PPT)
The emergence of Deep Artificial Neural Networks (DNNs) is revolutionizing information technology with an emphasis on extracting information from massive data corpora. Deep Learning is the process training of a DNN and is a highly numerically intensive operation with an emphasis on a small number of computational kernels that are well known in the high-performance computing community, such as generalized matrix/matrix multiplication and other dense stencil computations. In 2016, IBM introduced the new S822LC for HPC server designed to deliver unprecedented performance for both Artificial Intelligence as well as traditional High-Performance Computing (HPC) workloads.. With its high-performance NVlink connection, the S822LC for HPC server offers a sweet spot of scalability, performance and efficiency for Deep Learning applications. The next generation S822LC for HPC systems combine the balanced high-performance Power server design with four high-performance P100 GPUs which exploit dataflow principles to maximize throughput by scheduling groups of computational threads based on operand availability to hide latency and deliver peak performance. The GPUs are connected via NVlink for enhanced peer-to-peer GPU multiprocessing, and CPU-GPU NVlink for enhanced performance and programmability.
Kemal Ebcioglu, Global Supercomputer Inc., USA
Toward cloud center architectures for achieving performance in the limit ( Paper   |  PPT)
The dataflow computing model introduced an elegant and highly influential parallel alternative to the Von Neumann sequential computing model. But the dataflow approach introduced new programming models which are fundamentally incompatible with Von Neumann's sequential single-threaded instruction execution. Furthemore, there is a misconception that in the sequential Von Neumann computing model (much criticized by the dataflow community) instructions have to be executed sequentially. While there exist proofs of non-achievability of optimal theoretical performance in corner cases, in reality there is nothing that prevents a modern Von Neumann computer from executing an application within a time period which is little more than the critical path length of the entire execution trace of the application plus speed of light delays, while remaining fully compatible with the Von Neumann sequential execution model. The exascale, very large cloud centers of the near future comprising millions of FPGAs and ASICs will provide the infrastructure for enabling such performance, using application-specific, customized hardware compiled from a sequential application.
Hong An, University of Science and Technology of China 
A Dataflow-based Runtime Support Implementation on a 100P Actual System ( Paper   |  PPT)
The high availability of a 100P or more actual system such as Sunway TaihuLight for computing science and engineering applications keeps to be very attractive to the supercomputing community, since obtaining peak performances on irregular applications such as computer algebra problems remains a challenging problem. In this short talk, we will introduce our preliminary work of the dataflow-based runtime support implementation on Sunway TaihuLight, to exploit with great efficiency the computation resources of a 100P actual system.
Guang R. Gao, University of Delaware, USA
Moderator Remarks ( Paper   |  PPT)
Open to all audience
Q & A ( Paper   |  PPT)
Day2 Saturday, October 21st,Guizhou Hall

10:00-10:20 Coffee Break
10:20-12:00 Session 7   Architectures and Systems for AI
Chair:Zujie Ren
Sijiang Fan, Jiawei Fei and Li Shen
Accelerating Deep Learning with A Parallel Mechanism using CPU+MIC ( Paper   |  PPT)
Deep neural networks(DNNs) is one of the most popular machine learning methods and is widely used in many modern applications. The training process of DNNs is a time-consuming process. Accelerating the training of DNNs has been the focus of many research works. In this study, we speed up the training of DNNs applied for automatic speech recognition and the target architecture is heterogeneous (CPU+MIC). We apply asynchronous methods for I/O and communication operations and propose an adaptive load balancing method. Besides, we also employ a momentum idea to speed up the convergence of the gradient descent algorithm. Results show that our optimized algorithm is able to acquire a 20-fold speedup on a CPU+MIC platform compared with the original sequential algorithm on a single-core CPU.
Chengfan Jia, Junnan Liu, Xu Jin, Han Lin, Hong An, Wenting Han, Zheng Wu, Mongxian Chi
Improving the performance of distributed TensorFlow with RDMA ( Paper   |  PPT)
TensorFlow is an open-source software library designed for Deep Learning using data flow graph computation. Thanks to the flexible architecture of TensorFlow, users can deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. In a distributed TensorFlow work process, it uses gRPC to connect between different nodes. However, when deploying training tasks on high performance computing clusters, the performance of gRPC becomes a bottleneck of distributed TensorFlow system. HPC clusters are usually equipped with Infiniband network besides of traditional TCP/IP network. But open-sourced TensorFlow has not taken this advantage. We present a RDMA-capable design of TensorFlow. By porting the Tensor send/receive parts of TensorFlow into RDMA verbs, we finally get nearly 6x performance improvements over the original distributed TensorFlow based on gRPC. The TensorFlow system with RDMA support shows a great scalability among the training scale.
Yuntao Lu, Lei Gong, Chao Wang and Xuehai Zhou
An Efficient Accelerating Method for Large-scale Sparse Neural Networks ( Paper   |  PPT)
Neural networks have been widely used as a powerful representation in lots of research domains, such as Computer Vision, Natural Language Processing, and Artificial Intelligence, etc. In state-of-the-art researches intend to increase the number of neurons and synapses, which makes it both computationally and memory intensive and difficult to deploy on resource-limited platforms. Sparse methods are raised to reduce redundant synapses of neural networks, but conventional accelerators cannot benefit from the sparsity . In this paper, we propose an efficient accelerating method for sparse neural networks, which compresses sparse synapse weights and processes the compressed structure by an FPGA accelerator. Our compression method will achieve 50% and 10% compression ratio of synapse weights in convolutional and full-connected layers. The experiment results demonstrate that our accelerating method can boost an FPGA accelerator to achieve 3x speedup over a conventional one.
Fan Sun, Chao Wang, Lei Gong, Yiwei Zhang, Xi Li and Xuehai Zhou
A Pipeline Energy-efficient Accelerator for Convolutional Neural Networks ( Paper   |  PPT)
Convolutional neural networks (CNNs) have been widely applied for image recognition, face detection, and video analysis because of their ability to achieve accuracy close to or even better than human level perception. However, for large-scale CNNs, the computation-intensive convolution layers and memory-intensive convolution layers have brought many challenges to the implementation of CNN. In order to overcome this problem, this work proposes a pipeline energy-efficient accelerator for convolutional neural networks, and different methods are applied to optimize the convolution layers and convolution layers. For the convolution layer, the accelerator first rearranges the input features into matrix on-the-fly when storing them to the FPGA on-chip buffers, thus the computation of convolution layer can be completed through matrix multiplication. For the fully connected layer, the batch-based method is used to reduce the required memory bandwidth, which also can be completed through matrix multiplication. Then a two-layer pipeline computation method for matrix multiplication is proposed to increase the throughput and also reduce the buffer requirement. The experiment results show that the proposed accelerator can reduce the energy consumptions of CPU and GPU by 333.48x and 17.18x respectively, which demonstrates that the proposed accelerator outperforms CPU and GPU at the aspect of energy efficiency. The proposed accelerator runs under a frequency of 100 MHz which can achieve the throughput of 49.31 GFLOPS. This is done using only 198 DSP48 modules, which shows significant resource utilization improvement compared to the state-of-the-art.
Yuran Qiao, Junzhong Shen, Dafei Huang, Qianming Yang, Mei Wen and Chunyuan Zhang
Optimizing OpenCL Implementation of Deep Convolutional Neural Network on FPGA ( Paper   |  PPT)
Nowadays, the rapid growth of data across the Internet has provided sufficient labeled data to train deep structured artificial neural networks. While deeper structured networks bring about significant precision gains in many applications, they also pose an urgent demand for higher computation capacity at the expense of power consumption. To this end, various FPGA based deep neural network accelerators are proposed for higher performance and lower energy consumption. However, as a dilemma, the development cycle of FPGA application is much longer than that of CPU and GPU. Although FPGA vendors such as Altera and Xilinx have released OpenCL framework to ease the programming, tuning the OpenCL codes for desirable performance on FPGAs is still challenging. In this paper, we look into the OpenCL implementation of Convolutional Neural Network (CNN) on FPGA. By analysing the execution manners of a CPU/GPU oriented verison on FPGA, we find out the causes of performance difference between FPGA and CPU/GPU and locate the performance bottlenecks. According to our analysis, we put forward a corresponding optimization method focusing on external memory transfers. We implement a prototype system on an Altera Stratix V A7 FPGA, which brings a considerable 4.76x speed up to the original version. To the best of our knowledge, this implementation outperforms most of the previous OpenCL implementations on FPGA by a large margin.
12:25-13:00 Lunch Break
13:00-14:00 Session 8    Short & Poster Papers
Chair: Xuanhua Shi
Jianhui Lv, Xingwei Wang and Min Huang
ACO-inspired ICN Routing Scheme with Density-based Spatial Clustering ( Paper   |  PPT)
Information-Centric Networking (ICN) routing has been facing some challenges, such as efficient content retrieval, self-organized forwarding and good scalability, and the current proposals cannot solve these problems effectively. This paper proposes a new Ant Colony Optimization (ACO)-inspired ICN routing scheme with Density-based Spatial Clustering (DSC). At first, the content concentration model is established with network load to address interest routing; in particular, we investigate the failed situation of content retrieval. Then, the dot product method is used to compute similarity relation between two routers, which is considered as clustering reference attribute. In addition, DSC is exploited to detect core nodes which are used to cache contents during the data routing process. Finally, the proposed scheme is simulated over MiniNDN, and the results show that it has better performance than the benchmark scheme in terms of routing success rate, routing hop count, load balance degree, execution time and throughput.
Zhenxue He, Limin Xiao, Fei Gu, Zhisheng Huo, Li Ruan, Mingfa Zhu, Longbing Zhang and Xiang Wang
An Efficient Polarity Optimization Approach for Fixed Polarity Reed-Muller Logic Circuits Based on Novel Binary Differential Evolution Algorithm ( Paper   |  PPT)
The bottleneck of integrated circuit design could potentially be alleviated by using Reed-Muller (RM) logic circuits due to their remarkable superiority in power, area and testability. In this paper, we propose a Novel Binary Differential Evolution (DE) algorithm (NBDE) to solve the discrete binary-encoded combination optimization problem, which can achieve a better tradeoff between the exploration and exploitation capabilities by using a binary random mutation operator. Moreover, based on the NBDE, we propose an Efficient Polarity Optimization Approach (EPOA) for Fixed Polarity RM (FPRM) logic circuits, which uses the NBDE to search the best polarity under a performance constraint. To the best of our knowledge, we are the first to use DE to optimize RM circuits. The experimental results on 24 MCNC benchmark circuits show the effectiveness and superiority of EPOA, and confirm the application of NBDE as a promising tool for optimizing RM circuits.
Chang Zhao, Yusen Wu, Zujie Ren, Weisong Shi, Yongjian Ren and Jian Wan
Quantifying the Isolation Characteristics in Container Environments ( Paper   |  PPT)
In recent years, container technologies have attracted intensive attention due to the features of light-weight and easy-portability. The performance isolation between containers is becoming a significant challenge, especially in terms of network throughput and disk I/O. Compared with virtual machines (VMs), containers suffer from a worse isolation as they share not only physical resources but also OS kernels. A couple of solutions have been proposed to enhance the performance isolation between containers. However, there is a lack of study on how to quantify the performance isolation between containers. In traditional VM environments, the performance isolation is often calculated based on performance loss ratio. In container environments, the performance loss of well-behaved containers may be incurred not only by misbehaving containers but also by container orchestration and management. Therefore, the measurement models that only take performance loss into consideration will be not accurate enough. In this paper, we proposed a novel performance isolation measurement model that combines the performance loss and resource shrinkage of containers. We conducted a group of performance evaluation experiments based on an opensource container project Docker. Experimental results validate the effectiveness of our proposed model. Our results highlight the performance isolation between containers is different with the issue in VM environments.
Bo Wang, Jie Tang, Rui Zhang and Zhimin Gu
CSAS: Cost-based Storage Auto-Selection, A Fine Grained Storage Selection Mechanism for Spark ( Paper   |  PPT)
Spark provides RDDs data abstraction to improve large data processing capabilities. To improve system performance, Spark places the RDDs into memory for further access through the caching mechanism. By default, RDDs can only be cached in memory which restricts the usage of Spark. When memory is not sufficient, jobs may fail. Therefore, Spark provides a variety of storage levels to meet different storage scenarios. However, the RDD-grained manual storage level selection mechanism can not take full advantage of the system’s computing resources. In this paper, we firstly present a fine-grained automatic storage level selection mechanism. And then we select a storage level for a partition based on a cost model which fully considering the system resources status, compression and serialization costs. Experiments show that our approach can offer a up to 77% performance improvement compared to the default storage level scheme provided by Spark.
Jie Zhang and Myoungsoo Jung
An In-depth Performance Analysis of Many-Integrated Core for Communication Efficient Heterogeneous Computing ( Paper   |  PPT)
Many-integrated core (MIC) architecture combines dozens of reduced x86 cores onto a single chip, which can execute diverse user applications on existing programming frameworks without a major modification. By employing many general purpose cores, MICs offer high degrees of parallelism while operating with a same instruction set that the main processor uses. These generality and flexibility characteristics of MICs can address training barriers and allow programmers to more focus on problems instead of system engineering. However, parallel applications executed across many cores that exist in one or more MICs require a series of work related to data sharing and synchronization, which may render users difficult to exploit full advantage of massive parallelism that MICs deliver. In this work, we build a real CPU+MIC heterogeneous cluster, which consists of eight main processor cores and 244 physical MIC cores (61 cores per MIC device). We analyze the performance behaviors of the heterogeneous cluster by examining different communication methods such as message passing method and remote direct memory accesses. Our evaluation results and in-depth studies reveal that i) aggregating small messages can improve network bandwidth without violating latency restrictions, ii) while MICs can execute hundreds of hardware cores, the highest network throughput is achieved when only 4 ∼ 6 point-to-point connections are established for data communication, iii) data communication over multiple point-to-point connections between host and MICs introduce severe load unbalancing, which require to be optimized for future heterogeneous computing.
Jie Wu, Binzhang Fu and Mingyu Chen
Stem: A Table-based Congestion Control Framework for Virtualized Data Center Networks ( Paper   |  PPT)
Congestion control is one of the biggest challenges faced by networks, and is enlarged in current data centers due to its large scale and variety of applications. Generally, different kinds of applications prefer different congestion control solutions. However, current mechanisms often exploit customized framework and require dedicated modules to realize certain functions, then deploying multiple solutions at the same time or reloading another solution when a new application is served is almost impossible. To address this problem, this paper proposes a solely table-based congestion control framework which is compatible with most of the current congestion control solutions. To this end, this paper proposes a new OpenFlow action, namely the EXEC FUNC action, to facilitate network operators to define tenant-specific customized control logics. Furthermore, a DCTCP-like congestion control solution is proposed and implemented with the proposed Stem architecture. To this end, a new OpenFlow group, namely specific-match group, is proposed to simplify the collection of stateful statistics. Finally, a new algorithm to increase flow rate is presented to improve the stability of DCTCP. We implement the prototype with Open vSwitch, and the experiments results show that the proposed Stem could achieve the above claimed benefits with negligible overhead.
Tian Yang, Kenjiro Taura and Liu Chao
SDAC: Porting Scientific Data to RDDs ( Paper   |  PPT)
Scientific data processing has exposed a range of technical problems in industrial exploration and specific-domain applications due to its huge input volume and data format diversity. While Big Data analytic frameworks such as Hadoop and Spark lack their native supports for processing increasing heterogeneous scientific data efficiently as hierarchy structure and exclusive indexing method raise the difficulties in promoting parallel I/O and unified processing. In this paper, we introduce our recent work named SDAC (Scientific Data Auto Chunk) for porting various scientific data to RDDs to support parallel processing and analytics in Apache Spark framework. With the combination of MapReduce paradigm and our auto-chunk task granularity-specify method, a better-planned theoretical pipeline aiming at improving the level of parallelism can be derived to navigate scientific data partitioning and in parallel operations. We showcase full performance comparison with a mature module atop Spark named H5Spark within 6 benchmarks including Genetic, K-means and Logistic Regression in both standalone and distributed mode. Experimental results showed with the integration of SDAC module we achieve an overall improvement of 2.1 times over H5Spark in standalone mode, and 1.34 times in distributed mode.
Xuefei Wang, Ruohang Feng, Wei Dong, Xiaoqian Zhu and Wenke Wang
Unified Access Layer with PostgreSQL FDW for Heterogeneous Databases ( Paper   |  PPT)
Large-scale application systems usually consist of various databases for different purposes. However, the increasing use of different databases, especially NoSQL databases, makes it increasingly challenging to use and maintain such systems. In this paper, we demonstrate a framework for designing a foreign data wrapper (FDW) for external data sources. We propose a novel method to access heterogeneous databases, including SQL and NoSQL databases, by using a unified access layer. This method was applied in some real business applications of Alibaba, in which we were able to do various operations on Redis, MongoDB, HBase, and MySQL by using a simple SQL statement. In addition, the information exchange and data migration between these databases can be done by using unified SQL statements. The experiments show that our method can maintain good database performance and provide users with a lot more convenience and efficiency.
Bingyu Zhou, Limin Xiao, Zhisheng Huo, Bing Wei, Jiulong Chang, Nan Zhou and Li Ruan
Balancing Global and Local Fairness Allocation Model in Heterogeneous Data center ( Paper   |  PPT)
Aiming at the problem that the bandwidth allocation method in heterogeneous data center can’t take into account the global and local fairness, we propose an allocation method called GLA that satisfies fair attributes and can improve utilization by nearly 20% when the number of clients are more than six. We prove the algorithm is effective from theoretical derivation and experimental verification. It’s also protable and scalable that can adapt to most of systems.Finally,we analyze the merits and demerits of the algorithm objectively and develop the next plan.
Jian Li and Jian Cao
System Problem Detection by Mining Process Model from Console Logs ( Paper   |  PPT)
The console log messages play an important role in the maintenance of modern large-scale distributed systems. Given the explosive growth of large-scale services, manually detecting problems from console logs is infeasible. Hence, automated anomaly detection techniques based on logs are badly needed to monitor large and complex distributed systems. In the current study, we propose a novel process mining algorithm to discover process model from console logs, and further use the obtained process model to detect anomalies. In brief, the console logs are first parsed into events, and the events from one single session are further grouped to event sequences. Then, a process model is mined from the event sequences to describe the main system behaviors. At last, we use the process model to detect anomalous log information. Experiments on Hadoop File System log dataset show that this approach can detect anomalies from log messages with high accuracy and few false positives. Compared with previously proposed automatic anomaly detection methods, our approach can provide intuitive and meaningful explanations to human operators as well as identify real problems accurately. Furthermore, the process model is easy to understand. Therefore, engineers can tune the process model based on their knowledge about the system design and implementation to improve the effectiveness of anomaly detection particularly.
14:00-16:05 Session 9   Regular Paper: Architectures and Systems Design and Optimizing
Chair: Xuanhua Shi
Wenjie Liu, Sheng Ma, Libo Huang and Zhiying Wang
The Design of NoC-side Memory Access Scheduling for Energy-Efficient GPGPUs ( Paper   |  PPT)
Memory access scheduling schemes, often performed in memory controllers, have a marked impact on alleviating the heavy burden placed on memory systems of GPGPUs. Existing out-of-order scheduling schemes, like FR-FCFS, improve memory access efficiency by reordering memory request sequences at the destination. Their effectiveness, however, is at the expense of complex logics and high power consumption. In this paper, we propose a NoC-side memory access scheduling based on the key insight that the transmission of on-chip networks is the dominating factor in destroying the row access locality and causing poor memory access efficiency. With appropriate NoC-side optimization, straight-forward in-order scheduling can be used in memory controllers to simplify scheduling logics and alleviate the tight power envelope. Moreover, we introduce several light-weight optimizations to further improve system performance. Experimental results on memory-intensive applications show that, comparing with FR-FCFS, our proposed scheme increases the overall system performance by 10.5%, reduces the power consumption by 20% and improves the energy efficiency by 36.9%.
Yang Shi, Yanmin Zhu and Linpeng Huang
Partial-PreSET: Enhancing Lifetime of PCM-Based Main Memory with Fine-Grained SET Operations ( Paper   |  PPT)
Phase change memory (PCM) is one of promising candidates to replace DRAM with its attractive features such as zero leakage power and high scalability. In PCM, SET operation needs much more time than RESET operation. A typical write request concurrently writes 64 bytes to the PCM memory line. Therefore, write latency is determined by SET operation. PreSET is proposed to improve PCM performance by exploiting asymmetry in write time. A PreSET operation pro-actively SETs all the bits in the memory line before a dirty cache line is written to PCM memory. Later, when a write request is processed, only RESET operation is performed. Consequently, PreSET shortens write latency which improves system performance. However, such PreSET operation is conducted at a coarse-grained level, which reduces the endurance of PCM. Through empirical study we find that in most applications the number of dirty words in a dirty line is actually quite limited. If we only SET only those dirty words, instead of the whole cache line, we would significantly extend the lifetime of PCM while still achieving desirable system performance. Inspired by this observation, we propose Partial-PreSET which balances performance and endurance of PCM system. The core idea of this scheme is to SET the dirty part of a cache line in a fine-grained fashion. Our experiments show that the proposed Partial-PreSET scheme significantly improves the average lifetime of PCM system, up to 2.79X, while incurring only 2% system performance loss, compared with the state-of-the-art scheme, i.e., PreSET.
Zhiwen Chen, Xin He, Jianhua Sun and Hao Chen
Have Your Cake and Eat it (too): A Concurrent Hash Table with Hardware Transactions ( Paper   |  PPT)
Hardware Transaction Memory (HTM) opens a new way to scaling multi-core software. Its main target is to achieve high performance on multi-core systems, and at the same time simplify concurrency control and guarantee correctness. This paper redesign an existing concurrent hash table using a number of HTM-based synchronization mechanisms. As compared with a fine-grained lock implementation, HTM-based locking scales well on our testing platform, and its performance is higher when running large-scale workloads. In addition, HTM-based global locking consumes much less memory. In summary, several observations are made in this paper with detailed experimental analysis, which may have important implication for future research of concurrent data structures and HTM.
Jian Gao, Kang Yu, Peng Qing and Hongmei Wei
MPFL: A Distributed Fault Localization Framework for High-Performance Computing Systems ( Paper   |  PPT)
Fault localization has become an increasingly challenging issue in high-performance computing (HPC) systems. Various techniques have been used for HPC systems. However, as the HPC systems scale out, resulting in the rapid deterioration of the existing techniques. In this context, we propose a message-passing based fault localization framework, namely MPFL, which provides a light-weight distributed service using tree-based fault detection (TFD) and fault analysis (TFA) algorithms. In essence, MPFL serves as a fault localization engine within message-passing libraries by enabling several system middleware such as job scheduler to provide abnormal information. We present details of the MPFL framework, including the implementation of TFD and TFA. Further, we develop the fault localization engine prototype within MVAPICH2. The experimental evaluation is performed on a typical HPC cluster with 10 computing nodes, which demonstrate the capability of MPFL and show that the MPFL service does not affect the performance of an application in practice.
Donghyun Gouk, Jie Zhang and Myoungsoo Jung
Enabling Realistic Logical Device Interface and Driver for NVM Express Enabled Full System Simulations ( Paper   |  PPT)
The data volumes are exploding, immense information has been created more than the storage capacity across all media types over the past 10 years. While the storage systems play a critical role in modern memory hierarchy, their interfaces and simulation models are overly simplified by computer-system architecture researches. Specifically, gem5, a popular full system simulator, includes only Integrated Drive Electronics (IDE) interface, which is originally designed at three decades ago, and simulate the underlying storage device with a constant latency value. In this work, we implement an NVMe disk and controller to enable a realistic storage stack of next generation interfaces, integrate them into gem5 and a high-fidelity solid state disk simulation model. We verify the functionalities of NVMe that we implemented, using a standard userlevel tool, called NVMe command line interface. Our evaluation results reveal that the performance of a high performance SSD can significantly vary based on different software stack and storage controller even under the same condition of device configurations and degrees of parallelism. Specifically, the traditional interface caps the performance of the SSD by 85%, whereas NVMe interface we implemented in gem5 can successfully expose the true performance aggregated by many underlying Flash media.
Tan Zhang, Chaobing Zhou, Huang Libo, Xiao Nong and Sheng Ma
Improving Branch Prediction for Thread Migration on Multi-Core Architectures ( Paper   |  PPT)
Thread migration is ubiquitous in multi-core architectures. When a thread migrates to an idle core, the branch information of the branch predictor on the idle core is absent, which will lead to the branch predictor works with comparatively low predicition accuracy until the warm-up finish. During the warm-up period, the branch predictor spends quite a lot of time on recording branch information and often makes mispredictions. These are the main performance impact of thread migration. In this paper, we point out that, when a thread migrates to an idle core, the prediction accuaracy can be improved by migrating branch history information from the source core to the target. Moreover, several migration strategies are introduced to fully exploit the performance of branch prediction migration. Experiment results show that, compared to the experiment baseline which dosen’t migrate any branch history information,branch prediction migration reduces MPKI of the branch predictor on new core by 43.46% on average.
16:05-18:10 Session 10   Regular Paper: Software Environments and Tools
Chair: Yu Zhang
Weiqi Dai, Yukun Du, Weizhong Qiang, Deqing Zou, Shouhuai Xu, Zhongze Liu and Hai Jin
RollSec: Automatically Secure Software State against General Rollback ( Paper   |  PPT)
The rollback mechanism is critical in crash recovery and debugging, but its security problems have not been adequately addressed. This is justified by the fact that existing solutions always require modifications on target software or only work for specific scenarios. As a consequence, rollback is either neglected or restricted or prohibited in existing systems. In this paper, we systematically characterize security threats of rollback as abnormal states of non-deterministic variables and resumed program points caused by rollback, which can generally apply to other scenarios of rollback problems. Based on this, we propose RollSec (for Rollback Security), which provides general measurements including state extracting, recording and compensating, to maintain correctness of these abnormal states for eliminating rollback threats. RollSec can automatically extract these states based on language-independent information of software as protection targets, which will be monitored during run-time, and compensated to correct states on each rollback without requiring extra modifications nor supports of specific architectures. At last, we implement a prototype of RollSec to verify its effectiveness, and conduct performance evaluations which demonstrate that only acceptable overhead is introduced.
Yuliang Shi, Jing Hu and Jianlin Zhang
Two-Stage Job Scheduling Model Based on Revenues and Resources ( Paper   |  PPT)
In the big data platform, multiple users share the resources of the platform. For platform providers, it is a problem to be solved urgently that how to multi-user jobs are scheduled efficiently to take full advantage of the resources of the platform, get the maximum revenue and meet the SLA requirements of the users. We research the project of job scheduling for MapReduce framework further. The paper proposes a two-stage job scheduling model based on revenues and resources. In the model, we design a scheduling algorithm of the maximum revenue (SMR) based on the latest start time of the jobs. The SMR algorithm ensures that the jobs which have larger revenues can be completed before the deadlines of the jobs, and then providers can gain the largest total revenue. Under the premise of ensuring the maximum revenue, a sequence adjustment scheduling algorithm based on the maximum resource utilization of the platform (SAS) is developed to improve the resource utilization of the platform. Experimental results show that the two-stage job scheduling model proposed in this paper not only realizes the maximum revenue of the provider, but also improves the resource utilization of the platform and the comprehensive performance of the platform. What is more, the model has great practicability and reliability.
Xiaoli Liu, Fan Yang, Yanan Jin, Wang Zhan, Zheng Cao and Ninhui Sun 
Regional Congestion Mitigation in Lossless Datacenter Networks ( Paper   |  PPT)
To stop harmful congestion spreading, lossless network needs much faster congestion detection and reaction than the end-to-end approach. In this paper, we propose a switch-level regional congestion mitigation mechanism (RCM) that performs traffic management just at the congestion region edge. RCM moves the end-to-end congestion control to hop-by-hop switch level to lower the congested region’s load as fast as possible. Meanwhile, to handle longer congestion, RCM detours the non-congestion flows to a light-loaded available path based on regional congestion degree to avoid the congestion region. Evaluation shows that the proposed RCM mechanism can perform timely congestion control over microburst flows, and achieve >10% improvement on mice flow’s FCT and throughput than DCQCN, with rarely performance reduction on elephant flows.
Xi Yang and Jian Cao
A fast and accurate way for API network construction based on semantic similarity and community detection ( Paper   |  PPT)
With the rapid growth of the number and diversity of web APIs on the Internet, it has become more and more difficult for developers to look for their desired APIs and build their own mashups. Therefore, web service recommendation and automatic mashup construction becomes a demanding technique. Most of the researches focus on the service recommendation part but neglect the construction of the mushups. In this paper, we will propose a new technique to fast and accurately build a API network based on the API’s input and output information. Once the API network is built, each pair of the connected APIs can be seen as generating a promising mushup. Therefore, the developers are freed from the exhausting search phase. Experiments using over 500 real API information gathered from the Internet has shown that the proposed approach is effective and performs well..
Yuxiang Zhang, Lin Cui, Fung Po Tso, Quanlong Guan and Weijia Jia
TCon: A Transparent Congestion Control Deployment Platform for Optimizing WAN Transfers ( Paper   |  PPT)
Nowadays, many web services (e.g., cloud storage) are deployed inside datacenters and may trigger transfers to clients through WAN. TCP congestion control is a vital component for improving the performance (e.g., latency) of these services. Considering complex networking environment, the default congestion control algorithms on servers may not always be the most efficient, and new advanced algorithms will be proposed. However, adjusting congestion control algorithm usually requires modification of TCP stacks of servers, which is difficult if not impossible, especially considering different operating systems and configurations on servers. In this paper, we propose TCon, a light-weight, flexible and scalable platform that allows administrators (or operators) to deploy any appropriate congestion control algorithms transparently without making any changes to TCP stacks of servers. We have implemented TCon in Open vSwitch (OVS) and conducted extensive test-bed experiments by transparently deploying BBR congestion control algorithm over TCon. Test-bed results show that the BBR over TCon works effectively and the performance stays close to its native implementation on servers, reducing latency by 12.76% on average.