Zeus: An Algorithm for Detecting Program Equivalence Between SML Programs

Proposal

Introductory computer science classes frequently use programming assignments as a primary means of assessment and practice for the students. As a result, introductory CS classes, which can be quite large, especially in the case of MOOCs (massive open online courses), tend to have more programming assignments to grade than a course staff can reasonably handle. The typical solution to this issue is to use some sort of auto grader that takes in a student programming assignment as input, runs various test cases on it, and assigns a grade based on the performance on said test cases. There are several merits to this solution, but a key detriment is that because grades are assigned simply based on whether certain inputs return the correct output, student feedback can provide no information beyond which tests passed and failed. The result is that students are forced to understand why their solution is right or wrong on their own, rather than being helped by the course feedback.

I would like to work with Professor Umut Acar and Professor Ruben Martins to help resolve this issue. Although it’s infeasible for course staffs to have to hand-grade every student assignment individually, if we can build an algorithm to sort student submissions into buckets of similar submissions (e.g. all students that implemented algorithm A but made mistake B), then it would be possible for the course staffs to look at submissions bucket by bucket and still provide personal, targeted feedback. Bucketing student submissions can be reduced to the problem of determining whether two individual programs are equivalent, so the research that I’d like to work on more specifically is developing an algorithm that determines whether two arbitrary SML programs are equivalent.

To determine whether two SML programs are equivalent, the basic outline of our approach will be as follows. LambdaPix is a language designed by Vijay Ramamurthy (a graduated CS undergraduate from whom I am inheriting this project) that is simpler but equivalent to SML, and is therefore easier to manipulate and analyze. So the first step of our algorithm, after being given two SML programs as inputs, will be to individually transpile each program into LambdaPix. Then, having transpiled the programs to LambdaPix, the algorithm will deconstruct each program bit by bit in parallel (by traversing down the two programs’ abstract syntax trees), attempting to map components of the first program to components of the second. As this process occurs, to keep track of and make sense of the connections between the two programs, a logical formula will be built up that describes the two programs and states that they are equivalent. This logical formula will contain lots of free variables, but will be provably valid (true for all assignments of the free variables) only if the two programs are provably equivalent. At the end of the process of fully deconstructing the programs and building up the logical formula, the algorithm will have produced a large formula that is valid only if the two programs are equivalent. To actually determine whether the programs are indeed equivalent, we can feed the formula into a theorem prover to determine whether the formula is valid, and therefore whether the programs are equivalent.

Goals

There are two primary, long-term goals pertaining to this project. First, we want to actually build an algorithm that is actually useful for some real CS class’ grading. In particular, we have identified 15-150 as a good candidate for making use of our algorithm. Second, it would be nice to attempt to publish a conference paper so as to share this approach. There are some efforts that strive towards both of these goals, and other efforts that only advance one or the other.

The obvious effort that advances both agendas is to improve the efficacy of the current algorithm. Specifically, the algorithm is designed to be sound, but incomplete (which is inevitable, as the problem that the algorithm is tackling is undecidable), but progress can be made towards making the algorithm less incomplete but still sound. In this context, this means making the algorithm more robust, and therefore capable of recognizing more dissimilar programs as equivalent still without ever incorrectly asserting that non-equivalent programs are equivalent. A variety of potential incremental improvements have been identified to further this end, including incorporating typing into LambdaPix (the current version of LambdaPix is untyped), adding additional inference rules to the main algorithm (allowing the algorithm to deconstruct the LambdaPix programs and build up the logical formula more intelligently), and creating more robust notions of equivalence so that when two programs use helper functions to handle sub-problems similarly, but not in exactly the same way, the relationship between the helper functions and sub-problems can be identified (e.g. If one program has a helper function that says whether the length of a list is odd, and the other has a helper function that says whether the length is even, then there is a clear connection between the two even though they technically aren’t equivalent).

An effort that would serve the first goal, but not necessarily the latter, would be to make the current algorithm more versatile and easier to use. In particular, the current implementation of the algorithm is sound with respect to saying that programs are equivalent, but unsound with respect to saying that programs aren’t. This is by design, as it is impossible for any algorithm to be sound in both directions. But after discussions with 15-150 TA Harrison Grodin, it has come to light that it might be more useful for 15-150’s context to have the algorithm be sound with respect to saying that programs aren’t equivalent, and unsound with respect to saying that they are equivalent. Modifying the algorithm in this way would plausibly advance the agenda of making it more useful for a real CS course, but would likely be unhelpful towards to goal of writing a paper unless if the paper were substantially reframed.

An effort that would serve the second goal, but not the first, would be to work on proving the soundness of the current algorithm. The current algorithm has been created incrementally, building on the previous work of Vijay Ramamurthy, and for the portions of the algorithm that haven’t changed since Vijay graduated, this work is already complete. However, there are substantial portions of the algorithm that have been modified since, and in order to present this algorithm in a paper, those portions of the algorithm would need to be formalized, and a proof of those portions’ soundness would need to be established.

At this point in time, I can’t really identify what 75%, 100%, or 125% goals would look like exactly, because it is not clear at this point in time which avenues of progress we wish to allocate resources towards. I am currently in contact with 15-150 TA Harrison Grodin to determine how well the algorithm presently does on student data, and the decisions we make with regards to what avenues we wish to pursue will largely depend on those results. So until those results return, I can’t provide more specific information about the project’s desired goals except to outline the landscape of desirable outcomes as I have above.

Milestones

As mentioned previously, I am currently in contact with Harrison to benchmark the algorithm’s current performance with real student code. The results of these tests will largely influence the direction we take the project in next semester, so my main technical milestone for this semester is to work with Harrison in order to evaluate the algorithm’s current performance. The results of this evaluation will inform the direction that we wish to take this project in next semester, which will in turn inform the milestones for next semester. So until those results are obtained, I can’t provide any information concerning milestones for the spring semester.

Milestone Report 1

Literature Search

I presently have not read many papers that directly inform this research. At the recommendation of Professor Ruben Martins, I’ve read “Semantic Program Alignment for Equivalence Checking” by Churchill et al. [1], as well as “SemCluster: Clustering of Imperative Programming Assignments Based on Quantitative Semantic Features” by Perry et al. [2], but in truth, neither of these paper particularly informs our current approach. The direction of this project is set predominantly by the work done in Vijay Ramamurthy’s senior thesis, as this project is literally an extension of it, the expertise of Professors Umut Acar and Ruben Martins, and the needs of 15-150. Moving towards writing a paper would require me to more thoroughly acquaint myself with the existing literature, but for the purposes of simply advancing the algorithm, I am content to rely on Professors Umut Acar’s and Ruben Martins’ knowledge to inform when existing findings or techniques might be beneficial to consider.

Resources Needed

Running the algorithm requires access to SML compiler HaMLet and theorem solver Z3, both of which are free and easily accessible. The algorithm itself has very low time and memory costs, and can easily be run on a personal laptop. For benchmarking purposes, access to real student data is desirable, and a relationship with 15-150 has been formed for this purpose.

References

[1] Sharma Churchill, Padon and Aiken. Semantic program alignment for equivalence checking. In PLDI, 2019.

[2] Samanta Perry, Kim and Zhang. Semcluster: Clustering of imperative programming assignments based on quantitative semantic features. In PLDI, 2019.