CS 431/631 Assignments

CS631 Final Project

The final project is a requirement only for graduate students taking CS 631.

The topic of the final project can be on anything you wish in the space of big data. Anything reasonably related to topics that are covered in the course is within scope. For reference, there are four types of projects you might consider:

You may work in groups of up to three, or you can also work by yourself if you wish. The amount of effort devoted to the project should be proportional to the number of people in the team. As a guideline, the level of effort should be comparable to two assignments per person.

When you are ready, send me an email describing what you'd like to work on. I will provide you with feedback on appropriateness and scope of your proposed project. The "soft" deadline for this proposal is Mar 15, 2024. There is no penalty if you miss this deadline, but it is in your best interest to not leave this proposal to the last minute.

The deliverable for the final project is a report. Use the ACM Templates, or something similar. The contents of the report will vary depending on the type of project you are doing. However, it should certainly describe the goal of you project (what is your learning objective, or what problem are you trying to solve), your methodology, and some kind of evaluation of your results or progress. Your project proposal should explicitly describe how your project report (see below) will be organized: indicate what sections the report will have, and what you expect to present in each section. There are no hard limits on the length of your final report, but you should target something in the range of 5-10 pages.

The deadline for submission of your project report is 11pm on the last day of classes. As you are grad students, this deadline can be extended (somewhat) if you have a good reason.

Evaluation

Your final project will be evaluated according to the following criteria, with roughly equal weight placed on each one.

Your report should clearly indicate where you obtained any data that you used in your project. Include a link to the data if possible.

A note about "Evaluation": Primarially this is you evaluating whether or not you have achieved your stated objectives. If implementing an algorithm the most obvious approach here would be to verify that your implementation is correct, and to compare the speed against non-distributed implementations (keep in mind that a Spark implementation is expected to be slower due to the framework overhead, with the advantage that it runs in parallel and so scales better). The other categories of project will have a more subjective self-evaluation as the objectives are themselves more subjective.

The use of Apache Spark should be justified in your project. For example, if you analyze only 1 MB of data, isn't it better to use Python? Remember that it is okay to analyze a smaller dataset if (1) the dataset can potentially be considered big data. For example, using 20 MB of Twitter data makes sense because it can be potentially much bigger, (2) your Spark solution is scalable. Even if you are testing it on smaller datasets, it can potentially handle much bigger datasets. If you do not follow this rule, you cannot get more than 50% of the project mark.

Back to top