Search engines are not easy. While we can answer questions smoothly—ask your local football expert "Who wore jersey number 26 on the team leading 14-10 at halftime in the game played in the state that includes San Francisco on January 15, 1967?" and he will respond "The Green Bay Packers"—try tossing this query at a browser. It's not easy. MIT's CSAIL is responsible for the world's first true search engine, START, started (mind the pun) in December 1993. Naturally, with the advent of NLP advances and study in linguistics, especially language parsing, we now have a more accurate, robust model for answering such questions. It was my honor to join the Infolab for a summer under Ms. Sue Felshin as I worked on a specific aspect of START: improving answers through Wikipedia querying.


I was presented with a large, existing GitHub project, developed over probably more than 200,000 man-hours. The existing repository of efforts in language parsing, word tagging, and data scraping from online sources was already immense. Imagine being slapped by greatness—this is the feeling! My work, within a small group, would revolve around fixing issues raised in GitHub by other developers, creating and suggesting new methods for parsing Wikipedia content, and (for undergraduates) learning about the various technologies and implementations of code the 'masters at work', the graduate students, had performed (after all, a UROP isn't complete without standing on the shoulders of giants every so often)!


Out of all the UROPs, I'd say I learned the most from this one. Not only was I exposed to then-unfamiliar tools—especially Emacs1 (in the interview Ms. Felshin asked me "Have you heard of Emacs?", to which I, puzzled, responded "No")—I also learned how to maneuver my way around the command line, invoking programs like Telnet and Lisp (for interpretation of certain language patterns). GitHub as not a place for small-scale code development, but mammoth projects, was a new mindset for me, and I, coming in with far more knowledge of management in the sense of team leadership than code storage, put together the pieces as I asked the lab about the high-level concepts surrounding GitHub.


The opportunity to learn 'computer science', in multiple facets, was laid bare in WikipediaBase. I searched through Python documentation to program using re (regex), visited Emacs pages to use it more fluently (and give greater credence to my typing speed), and looked through the rest of the START GitHub to understand how different pieces of code performed together.


There were hundreds of bugs and issues, failures recorded in the unit testing. I developed a Python script which would automatically group issues based on common keywords in the error messages, and suggest which GitHub issues from the pull requests to address. In this way, I did not have to manually edit the code each of the ~250 times I found a unit test error; I reduced this to 27 separate cases to debug.


I also learned to patiently trace an issue to the end, no matter where it led. When a certain error message showed up, and directed me to a function, I assessed the function's performance by isolating it and writing test cases for myself, and then saw if there were any methods in the entire repository using that method. In this way, I could either fix the issue by myself, if it was within my 'sphere of influence', or post an issue to the group at large and specify where it came from. There was a beautiful parallelism between management of people and code, in which one knows one's place in view of the whole system.



Footnotes

1I still prefer Vim over Emacs! Strongly!