ENGR-E 516 ENGINEERING CLOUD COMPUTING

August 22 - December 16, 2022

Semester: Fall 2022

Lecture Time: Monday 4:10 PM - 6:40 PM

Lecture Location: Lindley Hall 025

Office Hour: Monday 9 AM - 10 AM

Office Location: Luddy Hall 4124

Instructor: Dr. Dingwen Tao (ditao@iu.edu)

TA/AI: TBD

Requirements

Students should be comfortable programming in Python and Java. Familiarity with parallel and distributed computing is highly recommended.

Course Overview

Cloud Computing is “A large-scale distributed computing paradigm that is driven by economies of scale, in which a pool of abstracted, virtualized, dynamically-scalable, managed computing power, storage, platforms, and services are delivered on demand to external customers over the Internet.” It has become a driving force for information technology over the past several years, and it is hinting at a future in which we won’t compute on local computers, but on centralized facilities operated by third-party compute and storage utilities. Governments, research institutes, and industry leaders are rushing to adopt Cloud Computing to solve their ever-increasing computing and storage problems arising in the Internet Age. There are three main factors contributing to the surge and interests in Cloud Computing: 1) rapid decrease in hardware cost and increase in computing power and storage capacity, and the advent of multi-core architecture and modern supercomputers consisting of hundreds of thousands of cores; 2) the exponentially growing data size in scientific simulation and Internet publishing and archiving; and 3) the wide-spread adoption of Services Computing and Web 2.0 applications.

Course Goals

This course is a tour through various topics and technologies related to Cloud Computing for those who want to learn how the cloud works and how to easily and effectively use the cloud for running their applications at low cost. This course includes a wide spectrum of cloud-based applications such as distributed data processing systems (e.g., MapReduce, Hadoop, Spark), data storage and caching (e.g., key-value stores), distributed file systems (e.g., HDFS), high-performance computing, serverless computing, etc. This course also includes design principles and optimization techniques for building large-scale cloud platforms such as hardware virtualization, distributed resource management and scheduling. This course will expose students to popular cloud platforms such as Amazon EC2 and Google Cloud Platform, as well as NSF-invested research cloud infrastructure such as Chameleon Cloud and Jetstrem2.

Interaction

The course involves lectures, invited speakers, discussions of research papers, project presentation.

Final Project

An essential component of this course is a research project. Students are expected to work on a cutting-edge research problem in cloud computing and implement their ideas. Students should form a team of at most three and prepare a one-page project proposal by late September. A midterm project report and a final report written in a conference or journal format are also required. Project ideas have to be discussed with the instructor. Each project will be presented at the end of the semester in the class or through a recorded video. Students will have an allocation on the Jetstream2 or other Cloud platforms.

Grading

Students will be graded on the quality of their products and evidence of intellectual engagement and growth with respect to the learning objectives, mainly based on based on 3-4 programming assignments (40%) and 1 final project (60%) including project proposal (10%), midterm report (15%), written paper (25%), presentation (10%).

Important Dates

Aug 22: NO CLASS (reading reference papers)

Aug 29: Introduction to cloud computing

Sep 05: Overview of parallel and distributed computing

Sep 12: Cloud architecture and infrastructure

Sep 19: The MapReduce programming model [1]

Sep 26: Distributed data processing frameworks - Hadoop and Spark [2]

Sep 30: Project proposal due

Oct 03: Resource virtualization [3], VMs and containers

Oct 10: Distributed file systems (e.g., GFS [4], HDFS [5]) - NO in-person meeting, ONLY Zoom

Oct 17: Distributed resource management and scheduling

Oct 24: Data center management (e.g., outages and fault tolerance [6])

Oct 28: Midterm project report due

Oct 31: Guest lecture by Dr. Ying Mao

Nov 07: Machine Learning in the Cloud (e.g., MLlib [7])

Nov 14: NO CLASS - travel to SC conference

Nov 21: Graph processing in the Cloud (e.g., GraphX [8])

Nov 28: HPC in the Cloud (Cloud HPC) [9, 10], serverless computing (e.g., FaaS) [11]

Dec 05: Project presentation 1

Dec 12: Project presentation 2

Dec 14: Project written paper due

Reference Papers

Jeffrey Dean and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51, no. 1 (2008): 107-113.
Zaharia et al. "Spark: Cluster computing with working sets." In 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 10). 2010.
Keith Adams and Ole Agesen. "A comparison of software and hardware techniques for x86 virtualization." In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 2-13. 2006.
Ghemawat et al. "The Google file system." In Proceedings of the nineteenth ACM symposium on Operating systems principles (SOSP), pp. 29-43. 2003.
Shvachko et al. "The hadoop distributed file system." In 2010 IEEE 26th symposium on mass storage systems and technologies (MSST), pp. 1-10. Ieee, 2010.
Thomas Welsh and Elhadj Benkhelifa. "On resilience in cloud computing: A survey of techniques across the cloud domain." ACM Computing Surveys (CSUR) 53, no. 3 (2020): 1-36.
Meng et al. "Mllib: Machine learning in apache spark." The Journal of Machine Learning Research 17, no. 1 (2016): 1235-1241.
Gonzalez et. "GraphX: Graph Processing in a Distributed Dataflow Framework." In 11th USENIX symposium on operating systems design and implementation (OSDI), pp. 599-613. 2014.
He et al. "Case study for running HPC applications in public clouds." In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC), pp. 395-401. 2010.
Chard et al. "Funcx: A federated function serving fabric for science." In Proceedings of the 29th International symposium on high-performance parallel and distributed computing (HPDC), pp. 65-76. 2020.
IBM. "What is Serverless?" https://www.ibm.com/cloud/learn/serverless, online.

Example Projects

1. Investigate how to leverage compression to accelerate function-as-a-service (FaaS) for scientific workloads. As some of you may know Prof. Prateek Sharma and his students at IU ISE have been working in serverless computing and published several top-conference papers such as HPDC’22 and ASPLOS’22 (please refer to Prof. Sharma’s website: https://cgi.luddy.indiana.edu/~prateeks/). Prateek and I discussed the idea of using compression to accelerate the data communication in FaaS. An evaluation of different scientific workloads using different lossy compression and Cloud infrastructures and the performance impacts could be an interesting paper.

References:

Malla, S. and Christensen, K., 2020. HPC in the cloud: Performance comparison of function as a service (FaaS) vs infrastructure as a service (IaaS). Internet Technology Letters, 3(1), p.e137.
Chard, R., Babuji, Y., Li, Z., Skluzacek, T., Woodard, A., Blaiszik, B., Foster, I. and Chard, K., 2020, June. Funcx: A federated function serving fabric for science. In Proceedings of the 29th International symposium on high-performance parallel and distributed computing (pp. 65-76).
Copik, M., Kwasniewski, G., Besta, M., Podstawski, M. and Hoefler, T., 2021, November. Sebs: A serverless benchmark suite for function-as-a-service computing. In Proceedings of the 22nd International Middleware Conference (pp. 64-78).
Kim, J. and Lee, K., 2019, July. Functionbench: A suite of workloads for serverless cloud function service. In 2019 IEEE 12th International Conference on Cloud Computing (CLOUD)(pp. 502-504). IEEE.

2. Performance evaluation and analysis of running pre-exascale scientific applications with container technologies in HPC and Cloud infrastructure. Many top-tier conferences now require to use container technologies such as Docker, Singularity for artifacts evaluation. DOE is also considering to use Cloud infrastructure in the future. Therefore, evaluation of the performance of emerging exascale HPC applications (https://proxyapps.exascaleproject.org/ecp-proxy-apps-suite/) with containerization on both HPC and Cloud environments is important to understand the possibility of this trend. Moreover, there are some new containerization techniques that may be used for HPC as well such as Kubernetes, Podman. A comprehensive evaluation and discussion could be a good paper.

References:

Saha, P., Beltre, A., Uminski, P. and Govindaraju, M., 2018. Evaluation of docker containers for scientific workloads in the cloud. In Proceedings of the Practice and Experience on Advanced Research Computing (pp. 1-8).
Liu, P. and Guitart, J., 2022. Performance characterization of containerization for HPC workloads on InfiniBand clusters: an empirical study. Cluster Computing, 25(2), pp.847-868.
Hu, G., Zhang, Y. and Chen, W., 2019, August. Exploring the performance of singularity for high performance computing scenarios. In 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS) (pp. 2587-2593). IEEE.
Zhou, N., Georgiou, Y., Pospieszny, M., Zhong, L., Zhou, H., Niethammer, C., Pejak, B., Marko, O. and Hoppe, D., 2021. Container orchestration on HPC systems through Kubernetes. Journal of Cloud Computing, 10(1), pp.1-14.

3. Investigate Performance of ML Inference Serving on Serverless Platforms. Existing serverless platforms (such as AWS Lambda) work well for image-based ML inference, where requests are homogeneous in service demands. That said, recent advances in natural language processing (NLP) could not fully benefit from existing serverless platforms as their requests are intrinsically heterogeneous. To this end, this paper proposes a framework that optimizes the batching of heterogeneous ML inference serving requests to minimize their monetary cost while meeting their service level objectives. However, the paper only evaluates AWS Lambda (serverless computing service) and DeepSpeech (an NLP application). It would be very interesting to see its performance on other ML applications you like such as recommendation.

References:

Ali, A., Pinciroli, R., Yan, F. and Smirni, E., Optimizing Inference Serving on Serverless Platforms. https://www.vldb.org/pvldb/vol15/p2071-ali.pdf.
Ali, A., Pinciroli, R., Yan, F. and Smirni, E., 2020, November. Batch: Machine learning inference serving on serverless platforms with adaptive batching. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 1-15). IEEE.
Deploying machine learning models with serverless templates. https://aws.amazon.com/blogs/compute/deploying-machine-learning-models-with-serverless-templates/.
Machine learning inference at scale using AWS serverless. https://aws.amazon.com/blogs/machine-learning/machine-learning-inference-at-scale-using-aws-serverless/.
Accelerate and improve recommender system training and predictions using Amazon SageMaker Feature Store. https://aws.amazon.com/blogs/machine-learning/accelerate-and-improve-recommender-system-training-and-predictions-using-amazon-sagemaker-feature-store/.