Sunday, October 5, 2014

Digging for Data: Unearthing the Locations of Farmers Markets in Boston - Part 1

Synthesizing the single uniform dataset that served as the source of information for my last post's Tableau dashboard took a lot of digging. I wouldn't call it "data mining," since there were no massive datasets involved with hidden gems of knowledge to be gleaned, but it was definitely a messy, dirty process. To start, I combed through the city of Boston's publicly available files that can be found here (there's a lot of great content there which I intend to return to in many future posts).

There were five different files with two having the same name but different records, most having differing numbers of columns, and all having unique formats in the "Location" column, not to mention duplicate records here and there across files. Here are some snapshots of the files on the way in:

Farmers Markets w/Benefits


Summer Farmers Markets


Farmers Markets w/Fish


Farmers Markets w/Fish(1)


Winter Farmers Markets


In the end I wanted a single dataset with as much information as possible about each market. To achieve this, I turned to Alteryx. Here is what the first half of my data flow looked like in Alteryx.


You can see that after inputting each of the five text files involved, I used the Autofield tool to automatically convert as many fields as possible to the appropriate data type. Using the Text Input tool in conjunction with the Append Fields tool, I appended to both of the files representing farmers markets offering fish a new binary 'With Fish?' column to help differentiate them from the other markets later on. All of the files then underwent a parsing of some combination of the dates, times, and location fields.

I won't go into too much detail on parsing each and every field, but I will say the ability to embed RegEx into the Alteryx workflow proved invaluable here. If you need to parse on more than a single character or are interested in extracting specific bits of information from complex strings, RegEx is the only way to go.

Here is how the incoming Location column was structured in one of the files:



Here are the settings I entered into the RegEx tool:



And here are the columns I ended up with:



If you aren't already familiar with RegEx, I strongly recommend taking some time to learn it. I found this site particularly helpful when getting started (the only trick I used here that wasn't mentioned on that site was non-capturing groups).

Once every record across all files had a uniform address data structure, I used a series of joins, selects, and unions to remove any duplicate records. While Alteryx has the Unique tool, which lets you select what columns you consider keys and removes every record after the first that it finds with a specific key combination, I found that my approach allowed me to discard duplicates while being selective about which columns I kept around. When a duplicate was found, I took a combination of the most informative columns from each file and kept moving. You can see how my outer join was built in Alteryx below (the three duplicate patterns above one another). Two files are passed to the Join tool. Then all the records from the Left, Join, and Right outputs are unioned together using the Union tool. The Select tool in the middle removes any duplicate columns resulting from the successful Join.


After the final union we have our single record set. I then cleaned up a few fields all at once with the Multi-Field Formula tools, mostly to remove unnecessary spaces or convert null values in my Yes/No columns into 'Maybe's. Here's what my single file looked after the final Multi-Field Formula tool:


You'll notice there are more tools on the canvas and that we are still missing arguably the most important fields: latitude and longitude. That's because we have yet to discuss the Alteryx CASS tools.

The Alteryx CASS tools allow you to manipulate and augment address data in a variety of ways. With functions like address parsing, completion, and geocoding, the CASS tools compare input data to a file containing known addresses in the US and Canada. While you can only access these tools if you've purchased the Spatial package, and the Spatial package is a bit more pricey than the basic Designer package, it's pretty easy to appreciate the ROI if you work with addresses on a daily basis. I'll go over the specifics of these tools and how they fit into this data flow on my next post. For now just know that I would never have been able to finish incomplete addresses, correct inaccurate addresses, or assign accurate latitude and longitude coordinates for each market systematically without custom programming.

Just to recap I took five disparate files with different data structures, incomplete, and even inaccurate data, and blended them all together into a single source of cleanly addressed and geocoded truth using the workflow above. Before being introduced to Alteryx, I never would have thought that possible without involving a programming language like Python and a whole range of specialized libraries.

 You can download all the files, both inputs and output, as well as the Alteryx module shown above using this link. Note that the module won't actually run to completion unless you have the appropriate data packages installed, but you should be able to run everything up to the CASS tools and at least see how the tools are configured with any edition of Alteryx.

Thanks for reading.



No comments:

Post a Comment