Code Kata 4 – Data Munging

Here is the fourth of the CodeKata exercises.

Data munging–sounds kinda gross in some…indefinable way. But it’s actually some programmer’s
slang meaning “Mash Until No Good”, or
if you prefer recursion, it means “Mung Until No Good”.

This Kata is mostly in code with a few brief questions. I’ve chosen…wait for it…ASP.NET with C#. C#
because it’s what I’m currently best with, and ASP.NET because I find console apps dirty, and a full
Windows app to be overkill. I’ve made the source available on CodePlex. Okay, on with the Kata:

Part One: Weather Data

I initially thought the data was tab-delimited, but no, turns out it’s just fixed width. Fixed
width data from a 3rd party can be very tricky to write a reliable parser around, but since this is just
a learning exercise, I’m going to take a very naive, minimal approach. I created a “WeatherData” entity
class to hold the day, max temp, and min temp, and a method to calculate the difference. I don’t
need the other data, so why waste time trying to parse it? I created a WeatherReader class to read in
the text file and output a nice list of entity objects.

When parsing, first of all, I need to ignore the first 8 lines of the text file, because they are not data. Then, I need to
read in each line of data one at a time, and use a .Substring call to pull each column out. I put a wrapper
around the .Substring because I also wanted to Trim, remove invalid characters (notice the asterisks), etc.

When reading each line in a loop, I need a stopping place. Again, I took a naive approach and just stopped
reading once I reached the “mo” line.

Once I have a nice List of WeatherData, I just use some Linq to search for the day with minimum spread.
If you’ve not used Linq in C#, think of it like SQL for collections of objects, because, that’s pretty much
what it is.

Part Two: Soccer League Table

Pretty much the same story here. I created a SoccerTeam entity and a SoccerReader class to read in the text
file, parse it, and output a list. The data is a bit different in that it’s 5 lines at the top to ignore,
there’s a big dashed line right smack above Ipswich, and the data ends with a ‘</pre>’ instead of ‘mo’.

Part Three: DRY Fusion

So here’s the important part: factor out the common functionality of parts 1 and 2 in order to follow
the DRY principle.

This can be a tricky exercise because the parts are so deceptively simple. There’s obviously some of
the same things going on here, but I think an approach that’s too aggresive can lead to
some Liskov Substitution Principle issues.
A very aggresive approach might lead to one class to handle both the Weather and Soccer information, which one
might be able to pull off, but there are some downsides to that:

  1. If the weather.dat file format would change, any updated code could break the football.dat processing,
    and vice versa.
  2. If you want to add another file and entity, like grocery.dat, you will likely have a bunch of if/switch
    statements to update, and again, you risk breaking the other processors.
  3. Intuitively, Weather data and Soccer data are not really cohesive concepts. Even Baseball data and
    Soccer data would be a really big stretch, whereas English Premier League and Major League Soccer data could
    be much more cohesive.

So, instead, I took more of an incremental, systematic approach to refactoring.

I first created a base class: DatReader. I then made WeatherReader and SoccerReader inherit from DatReader.
There’s one method that was an obvious move to the base class: GetColumnValue, my SubString wrapper. The method
was 100% identical in both classes, so that’s an easy one. I also had a FileName string field that was
identical in both, so that was easy to move too.

Okay, that’s the fruit that’s close to the ground, anything else would require some actual refactoring:

  1. There
    were two constructors in each class, one of them worked directly with FileName, so I moved that and just changed
    the parameterless contructors in the subclasses to call “base” instead of “this”.
  2. The MapText…() and TheText() methods contain logic that is specific to the dat file, so while there is some
    overlap that is very tempting to refactor, I mostly left those alone, because in the real world, those naive
    implementations will probably change very, very often.
  3. However, the GetAll…() methods looked similar enough to factor out. In order to do that, the base class
    is going to need to be generic. Then I needed to add two abstract methods for TheText and MapTextToEntity, the latter
    of which will return my T generic type, and also be the new name of the MapText…() methods in the subclasses.
  4. WeatherReader should now inherit from DatReader<WeatherDay>, and likewise for SoccerReader.
  5. Now that MapTextToEntity is abstract, the GetAll…() methods don’t need to call anything but base class
    methods, so I refactored that to be just plain GetAll, returning an IList<T>.

Beautiful. It looks pretty darn elegant to me. Each subclass contains exclusively logic specific
to the dat file it’s trying to read. Any changes to the dat formats will only require changes in the subclasses,
and nowhere else.

Kata Questions

To what extent did the design decisions you made when writing the original
programs make it easier or harder to factor out common code?

Splitting the code into an Entity and EntityReader classes ala the Repository pattern I think made it easy
to identify what logic should go where. Additionally, splitting the functionality of the Reader classes into
small, readable methods made it much easier to see where refactoring would be a good idea, and more importantly,
where it would not.

Was the way you wrote the second program influenced by writing the first?

Yes. I practically copied and pasted my way through the whole thing. It took only minutes to write,
compared to the first, which took an hour or two.

Is factoring out as much common code as possible always a good thing? Did the
readability of the programs suffer because of this requirement? How about the maintainability?

1) Yes, except when it’s not. 2) A little. “GetAllSoccerGames” is much more usable than
“GetAll”, but I think the tradeoff is okay, because the class name is still “SoccerReader”.
3) Because I’ve dealt with fixed width text file parsing in the past, I was very cognizant of the
maintainability issues, so I think the way that I refactored it would be very easy to maintain, were this
a real application.

This a cool exercise. I feel like my solution is a very good one, and I think I went through this Kata in
exactly the way it was intended. If you disagree, please check out my CodeKata source code at CodePlex.

Copyright © All Rights Reserved · Green Hope Theme by Sivan & schiy · Proudly powered by WordPress