Skip to content

Latest commit

 

History

History
211 lines (163 loc) · 7.78 KB

2015-04-11.md

File metadata and controls

211 lines (163 loc) · 7.78 KB

Some notes from the exercism contributor repo :)

Pagination

These are the modifications we made to deal with the pagination

module GithubHelper
  def exercism_contributors
    Rails.cache.fetch(:contributors, expires_in: 1.hour) do
      github = Github.new do |config|
        config.client_id     = ENV["GH_BASIC_CLIENT_ID"]
        config.client_secret = ENV["GH_BASIC_SECRET_ID"]
      end
      exercism_repos = github.repos.list(user: "exercism").to_a.delete_if {|x| x[:size] == 0}

      exercism_repos.map do |r|
        1.upto(Float::INFINITY).with_object([]) do |pagenum, contributors|
          page = github.repos.list_contributors('exercism', r.name, page: pagenum)
          page.each { |c| contributors << Hash["repo",r,"contributor",c] }

          # identify when to stop paginating
          link = page.response.headers['link']
          last_pagenum = if link
            last_link = link.split(',').grep(/rel="last"/).first
            last_link ||= "page=#{pagenum}" # on the last page, it apparently doesn't give us the pagenum -.^
            last_link[/page=(\d+)/, 1].to_i
          else
            pagenum
          end

          break contributors if pagenum == last_pagenum
        end
      end
    end
  end
  # ...

Api keys

Add environment_variables.yml to .gitignore, and delete it from the repo. I typically do this like so:

# Delete the file from the repo (but not the file system)
$ git rm --cached vars.yml

# Ignore the file
$ echo vars.yml >> .gitignore
$ git add .gitignore

# We always check our work with git :P
$ git status

Hash graduating to Objects!

When we are making hashes that all have the same set of keys, an easier way to work with them is to make an object :)

Here's an example of what that might look like:

# these are defined in github_api
Repo        = Struct.new :repo_data
Contributor = Struct.new :contributor_data

# these come back from queries
repo        = Repo.new 'exercism.io'
contributor = Contributor.new 'kytrinyx'

# this is what we're handing to the views
ContributorData = Struct.new :repo, :contributor

Hash['repo', repo, 'contributor', contributor]
# => {"repo"=>#<struct Repo repo_data="exercism.io">,
#     "contributor"=>#<struct Contributor contributor_data="kytrinyx">}

cont = ContributorData.new repo, contributor
# => #<struct ContributorData
#     repo=#<struct Repo repo_data="exercism.io">,
#     contributor=#<struct Contributor contributor_data="kytrinyx">>

cont.repo.repo_data               # => "exercism.io"
cont.contributor.contributor_data # => "kytrinyx"

Once that becomes insufficient, we can just change it to be a normal class instead of a struct:

class ContributorData
  attr_accessor :repo, :contributor
  def initialize(repo, contributor)
    @repo, @contributor = repo, contributor
  end
end

A solid foundation for future changes!

From here, we can begin pushing all the data that we need into these objects. Then we can say things like data.contributions.count instead of data.contributor['contributions'].count and so forth.

This means we depend on objects we control. Which turns out to be a huuuuuuuuuge win!

For example, imagine finding some other endpoint that has all the data we need... buuuuut it's in a different format >.< If we depend on the structure of the data that comes back from this endpoint, we'll have to redo everything!

That's what people mean when they talk about "coupling" and "decoupling". When it's coupled, changes in the API break all our code. When it's decoupled, we only have to fix the boundaries where they enter our system.

Take a look at these speakers for the iPhone 4:

iphone 4 speakers

What a nice setup, we think, as we buy them. Then the iPhone 5 comes out. And we have to spend another couple hundred dollars buying new speakers, because the plug they use only works with our old phone >.< If they'd decoupled it, we could just swap out the piece that talks to the phone. The speakers are still perfectly good, they don't care where the music came from!

In our case, if we move the data, from whatever it looks like when we get it, into our own object, then we can just instantiate those objects from the new source, and all the code still works just fine! What's more, we can keep the code from each of the endpoints we find. Then, if we ever want to switch back, we've already got the code to extract the data, and our views and algorithms don't care which endpoint we got it from, so they're already good to go!

Or say we decided to keep a backup of the data somewhere. This way, if the Github api goes down, we can load the data from the backup. It'll be a bit old, but still mostly be just as good as the fresh information. Again, everything that depends on the data doesn't care that we got it from our backup, stored in a JSON file or a database or wherever, because in the end, we give them the same kind of objects!

What are going to do with this rate-limiting? At some point, it could become important to query the API in a background process, and maybe switch over to storing the data in our database. Perhaps we pull them out with ActiveRecord. Do we have to update all our code to know about what the ActiveRecord data looks like? Suddenly the relationships are based on our database schema instead of the API format. We make a single class to pull the data from the database and instantiate our objects. Mimosas all around :P

Four completely reasonable and somewhat probable sources of data: two endpoints with results in a different structure, a backup file, an asynchronous worker saving to a database. We can handle the all. Heck, we can even handle them all at the same time ^_^

Easy testing and the 1 billion percent speed increase

1 billion percent? Not hyperbole!

Lets say we want to make the open source report card. It probably requires some crunching of the data. Then we can just instantiate a few objects and run an algorithm against them, without worrying about rate limiting, or having to pay the cost of web requests on every run.

In our case, there's a lot of web requests (30 repos * 30 contributors per page * number of pages). It's the kind of delay that is very painful. It quickly begins taking seconds to get the data, and the longer the repos are out there, the more data we have, the longer it takes. I've seen (and written >.<) API consumption that spand minutes

But how expensive is it to instantiate some Ruby objects? It's super cheap!

Contributor   = Struct.new :repo, :contributor
num_objects   = 1_000_000
start_time    = Time.now
num_objects.times { Contributor.new 'repodata', 'contributordata' }
total_seconds = Time.now - start_time
printf "Time to make %d objects: %0.2f seconds (%0.2f objects per second)!\n",
       num_objects,
       total_seconds,
       num_objects / total_seconds

# >> Time to make 1000000 objects: 0.42 seconds (2387831.61 objects per second)!

Once we depend on simple Ruby objects that we can create from any source, we can trivially instantiate them for our tests. It becomes easy to make tests that hit all the situations, and even the weird edge cases. We can even run them without loading Rails and the database, test results come back as soon as we press "enter" on the command to run the suite.

TDD proponents often say things like "when testing is hard, it usually implies a design issue". It took me a really long time to figure out what they meant and what to do about it >.< So, I like to say it as the contraposition: If our design is decoupled, testing becomes fast and easy!

And the great thing is, once we learn to recognize the difficulties that things like coupling place on our development, we can begin to address them in ways that fundamentally make the code more resilient and flexible all around!