In this project, I was tasked to create a command-line application that scrapes a website and drills-down at least one level into the data of that site. I decided to do something involving NASA. Over the years, the solutions to the unique problems they’ve had to solve have done so much to improve our lives on Earth, and I wanted to create an application that allowed users to explore an archive of their space missions.
I spent the first couple days just researching and preparing. I needed to wrap my brain around the problem domain; before even opening a code editor, I had to figure out the problems that I needed to solve, best practices, and areas where new problems could emerge. In my experience, no plan survives 1st implementation, so I had to figure out which area(s) could potentially create greater delays.
I was provided two videos that proved very useful in this. One was a walk-through in creating the ruby gem’s configuration, environment, and user-facing elements. This was incredibly helpful, as that’s one aspect of the process I didn’t want to spend much time troubleshooting. The other video was about common antipatterns, which I would define as poor data structure design choices. The video showed an application that worked, but involved zipping disparate arrays together; even with my limited coding experience, I could see it would be very cumbersome to debug and/or maintain. So seeing examples of how * not * to build it was very instrumental going forward.
Lastly, I had to find a site that was scrape-able. What I mean by that is it has a repeated & uniform structure, and the HTML tags have CSS selectors. Nokogiri (the web-scraping ruby gem) utilizes the CSS selectors to create the data structures I can use in my program. My first attempts were Wikipedia and the NASA Space Science Data Coordinated Archive; while they both have incredibly useful & descriptive data, neither of their sites have a structure that I could use. I eventually landed on worldhistoryproject.org, which does use a web-scraper friendly structure.
The NASA section of worldhistoryproject.org has a couple potential downsides: 1) it’s not very comprehensive, and 2) it’s spread across two pages (which could induce overly-complicated code). The lack of a complete encyclopedic collection wasn’t really an issue in my use-case, as World History Project only has major media events that the common person would be most interested in. However, it is a noteworthy limitation if your particular project demands a more complete archive.
In regards to the information being spread across two pages, I initially created two scraping methods that referenced a different URL. This felt overly-cumbersome to me, and went against the UNIX approach of “each entity only does one task.” So I modified the cli file that handles the logic for all the user interactions and passes-in the appropriate URL to the scraping method.
I also got an idea for this app in the ruby gem walk-through video I mentioned earlier. In it, the presenter showed how to open a URL in the default browser. This was the last piece I added, as I wanted to ensure that I was getting the right data in all the right places beforehand. Since Linux is the only operating system I run on my personal computers, that was the only OS I could test this functionality on.
The first surprise problem I had to solve was the shebang on the user-facing executable, and this was more due to hubris on my part. The shebang is the first line of an executable file that tells the operating system which interpreter to use. Instead of looking at the console file that bundler created for me and copying that shebang, my thought process at the time was “oh I know how to find that in Linux,” and typed in “whereis ruby,” which showed where it was installed from my distribution’s repo’s…but the environment I setup utilizes RVM (Ruby Version Manager), which installed everything into a different directory.
As I incrementally tested my app’s functionality, I noticed that the event list it printed would include my previous selection when I returned to the main menu. This was not a very difficult fix, which involved deleting the entire list of events before instantiating a new list using the URL for the year range that the user selected. So if a user chose the range 1957-1983, then went back to the main menu & chose 1983-2012, they would only see events for the years they just chose.
I had a similar challenge when setting the URL for each event instance. On their site, the link to each event’s article only has the relative path (doesn’t include the domain name), so I had to find a way to append that to the domain before instantiating the object. I set a variable to the domain name before iterating through the list of events, and I was adding the event’s article URL inside the each block. This resulted in the relative path of each article appended to the full URL of the previous event, so only the first event had the correct URL. This was resolved by moving that variable declaration inside the each block.
This was a very fun exercise, and I can’t thank the creators of the ruby gem Pry enough. It was instrumental in ensuring that I was pulling the information that I needed from the site. The source code is available on Github at: https://github.com/kerneltux0/NASA_Timeline_Scraper and I’ve licensed it as open-source under the General Public License. I think that the GPL is a major reason behind the growth of the open-source ecosystem (and the Linux ecosystem especially), and I think this is a good way to contribute to that community.