9.2. Exploring a Legacy Codebase

If you’ve chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.

—Rob Pike

The goal of exploration is to understand the app from both the customers’ and the developers’ point of view. The specific techniques you use may depend on your immediate aims:

  • You’re brand new to the project and need to understand the app’s overall architecture, documenting as you go so others don’t have to repeat your discovery process.

  • You need to understand just the moving parts that would be affected by a specific change you’ve been asked to make.

  • You’re looking for areas that need beautification because you’re in the process of port- ing or otherwise updating a legacy codebase.

Just as we explored SaaS architecture in Chapter 3 using height as an analogy, we can follow some “outside-in” steps to understand the structure of a legacy app at various levels:

  1. Check out a scratch branch to run the app in a development environment

  2. Learn and replicate the user stories, working with other stakeholders if necessary

  3. Examine the database schema and the relationships among the most important classes 4. Skim all the code to quantify code quality and test coverage

Since operating on the live app could endanger customer data or the user experience, the first step is to get the application running in a development or staging environment in which perturbing its operation causes no inconvenience to users. Create a scratch branch of the repo that you never intend to check back in and can therefore be used for experimentation. Create a development database if there isn’t an existing one used for development. An easy way to do this is to clone the production database if it isn’t too large, thereby sidestepping numerous pitfalls:

  • The app may have relationships such as has-many or belongs-to that are reflected in the table rows. Without knowing the details of these relationships, you might create an in- valid subset of data. Using RottenPotatoes as an example, you might inadvertently end up with a review whose movie_id and moviegoer_id refer to nonexistent movies or moviegoers.

  • Cloning the database eliminates possible differences in behavior between production and development resulting from differences in database implementations, difference in how certain data types such as dates are represented in different databases, and so on.

  • Cloning gives you realistic valid data to work with in development.

If you can’t clone the production database, or you have successfully cloned it but it’s too unwieldy to use in development all the time, you can create a development database by extracting fixture data from the real database5 using the steps in Figure 9.3.

# on production computer:
RAILS_ENV=production rake db:schema:dump
RAILS_ENV=production rake db:fixtures:extract
# copy db/schema.rb and test/fixtures/*.yml to development computer
# then, on development computer:
rake db:create        # uses RAILS_ENV=development by default
rake db:schema:load
rake db:fixtures:load
Figure 9.3: You can create an empty development database that has the same schema as the production database and then populate it with fixtures. Although Chapter 8 cautions against the abuse of fixtures, in this case we are using them to replicate known behavior from the production environment in your development environment.

Once the app is running in development, have one or two experienced customers demonstrate how they use the app, indicating during the demo what changes they have in mind (Nierstrasz et al. 2009). Ask them to talk through the demo as they go; although their comments will often be in terms of the user experience (“Now I’m adding Mona as an admin user”), if the app was created using BDD, the comments may reflect examples of the original user stories and therefore the app’s architecture. Ask frequent questions during the demo, and if the maintainers of the app are available, have them observe the demo as well. In Section 9.3 we will see how these demos can form the basis of “ground truth” tests to underpin your changes.

Once you have an idea of how the app works, take a look at the database schema; Fred Brooks, Rob Pike, and others have all acknowledged the importance of understanding the data structures as a key to understanding the app logic. You can use an interactive database GUI to explore the schema, but you might find it more efficient to run rake db:schema:dump, which creates a file db/schema.rb containing the database schema in the migrations DSL introduced in Section 4.2. The goal is to match up the schema with the app’s overall architecture.

Figure 9.4 shows a simplified Unified Modeling Language (UML) class diagram generated by the railroady gem that captures the relationships among the most important classes and the most important attributes of those classes. While the diagram may look overwhelming initially, since not all classes play an equally important structural role, you can identify “highly connected” classes that are probably central to the application’s functions. For example, in Figure 9.4, the Customer and Voucher classes are connected to each other and to many other classes. You can then identify the tables corresponding to these classes in the database schema.

9.4
Figure 9.4: This simplified Unified Modeling Language (UML) class diagram, produced automatically by the railroady gem, shows the models in a Rails app that manages ticket sales, donations, and performance attendance for a small theater. Edges with arrowheads or circles show relationships between classes: a Customer has many Visits and Vouchers (open circle to arrowhead), has one most_recent_visit (solid circle to arrowhead), and has and belongs to many Labels (arrowhead to arrowhead). Plain edges show inheritance: Donation and Voucher are subclasses of Item. (All of the important classes here inherit from ActiveRecord::Base, but railroady draws only the app’s classes.) We will see other types of UML diagrams in Chapter 11.

Having familiarized yourself with the app’s architecture, most important data structures, and major classes, you are ready to look at the code. The goal of inspecting the code is to get a sense of its overall quality, test coverage, and other statistics that serve as a proxy for how painful it may be to understand and modify. Therefore, before diving into any specific file, run rake stats to get the total number of lines of code and lines of tests for each file; this information can tell you which classes are most complex and therefore probably most important (highest LOC), best tested (best code-to-test ratio), simple “helper” classes (low LOC), and so on, deepening the understanding you bootstrapped from the class diagram and database schema. (Later in this chapter we’ll show how to evaluate code with some additional quality metrics to give you a heads up of where the hairiest efforts might be.) If test suites exist, run them; assuming most tests pass, read the tests to help understand the original developers’ intentions. Then spend one hour (Nierstrasz et al. 2009) inspecting the code in the most important classes as well as those you believe you’ll need to modify (the change points), which by now you should be getting a good sense of.

9.5
Figure 9.5: A 3-by-5 inch (or A7 size) Class–Responsibility–Collaborator (CRC) card representing the Voucher class from Figure 9.4. The left column represents Voucher’s responsibilities—things it knows (instance variables) or does (instance methods). Since Ruby instance variables are always accessed through instance methods, we can determine responsibilities by searching the class file voucher.rb for instance methods and calls to attr_accessor. The right column represents Voucher’s collaborator classes; for Rails apps we can determine many of these by looking for has_many and belongs_to in voucher.rb.

Self-Check 9.2.1. What are some reasons it is important to get the app running in development even if you don’t plan to make any code changes right away?

A few reasons include:

  1. For SaaS, the existing tests may need access to a test database, which may not be accessible in production.

  2. Part of your exploration might involve the use of an interactive debugger or other tools that could slow down execution, which would be disruptive on the live site.

  3. For part of your exploration you might want to modify data in the database, which you can’t do with live customer data.