Data-Intensive Distributed Computing

CS631 Final Project

The final project is a requirement only for graduate students taking CS 631.

The topic of the final project can be on anything you wish in the space of big data. Anything reasonably related to topics that are covered in the course is within scope. For reference, there are four types of projects you might consider:

Learn additional capabilities (e.g., visualization) of Python and Jupyter, and use them to build an interactive notebook for visualizing or exploring a dataset of your choosing. Your interactive notebook should interact with Spark, so that it will be capable of supporting exploration of data sets that are too large to fit in the memory of a single machine.
Implement a big data algorithm in Spark: choose a particular big data algorithm (for processing text, graphs, relational data, etc.) and implement it. Ideally, the implementation does not already exist in a library or open-source package. Since we want you to implement the algorithm from scratch, it might perhaps be too tempting to simply copy existing code—see notes on academic integrity.
Learn and explore a (new) big data processing framework: although we discussed a variety of processing frameworks in class, the assignments focused on Spark. Here's your chance to learn a new processing framework, e.g., Spark Streaming, GraphX, Giraph, Flink, etc. The project would involve learning to use the processing framework and doing something interesting with it. The "something interesting" might be a data mining algorithm, although the expectations would be lower than building something in Spark, since learning the new framework would form an essential component of the project.
Perform some interesting data science. Is there a particular dataset you'd like to explore or analyze? Your project could involve performing interesting analytics on a dataset—here, the focus would be the analytical product and the insights gleaned, as opposed to the raw algorithms themselves. However, a superficial analysis with existing machine-learning libraries is not enough.

You may work in groups of up to three, or you can also work by yourself if you wish. The amount of effort devoted to the project should be proportional to the number of people in the team. As a guideline, the level of effort should be comparable to two assignments per person.

When you are ready, send me an email describing what you'd like to work on. I will provide you with feedback on appropriateness and scope of your proposed project. The "soft" deadline for this proposal is Mar 15, 2024. There is no penalty if you miss this deadline, but it is in your best interest to not leave this proposal to the last minute.

The deliverable for the final project is a report. Use the ACM Templates, or something similar. The contents of the report will vary depending on the type of project you are doing. However, it should certainly describe the goal of you project (what is your learning objective, or what problem are you trying to solve), your methodology, and some kind of evaluation of your results or progress. Your project proposal should explicitly describe how your project report (see below) will be organized: indicate what sections the report will have, and what you expect to present in each section. There are no hard limits on the length of your final report, but you should target something in the range of 5-10 pages.

The deadline for submission of your project report is 11pm on the last day of classes. As you are grad students, this deadline can be extended (somewhat) if you have a good reason.

Evaluation

Your final project will be evaluated according to the following criteria, with roughly equal weight placed on each one.

Scope/Relevance (20%):Is the objective clear? Is the project course-related and substantial enough?
Methodology (30%):Is the methodology appropriate and clearly described?
Evaluation (30%):Did you evaluate your work? Did you achieve your objective? If not, did you explain why not?
Presentation (20%):Is your report well organized and clearly written?

Your report should clearly indicate where you obtained any data that you used in your project. Include a link to the data if possible.

A note about "Evaluation": Primarially this is you evaluating whether or not you have achieved your stated objectives. If implementing an algorithm the most obvious approach here would be to verify that your implementation is correct, and to compare the speed against non-distributed implementations (keep in mind that a Spark implementation is expected to be slower due to the framework overhead, with the advantage that it runs in parallel and so scales better). The other categories of project will have a more subjective self-evaluation as the objectives are themselves more subjective.

The use of Apache Spark should be justified in your project. For example, if you analyze only 1 MB of data, isn't it better to use Python? Remember that it is okay to analyze a smaller dataset if (1) the dataset can potentially be considered big data. For example, using 20 MB of Twitter data makes sense because it can be potentially much bigger, (2) your Spark solution is scalable. Even if you are testing it on smaller datasets, it can potentially handle much bigger datasets. If you do not follow this rule, you cannot get more than 50% of the project mark.

Assignments Data-Intensive Distributed Computing (Winter 2024)

CS631 Final Project

Evaluation

Assignments
Data-Intensive Distributed Computing (Winter 2024)