Skip to content
This repository has been archived by the owner on Feb 23, 2018. It is now read-only.

Github Move

jsuereth edited this page Aug 22, 2011 · 7 revisions

Intro

We intend to move the canonical scala repository from svn / epfl.ch to git / github.com.

Background info for understanding preconditions:

The scala build process requires a bootstrap compiler, called "starr" for "stable reference Iforrrrrget", to get the ball rolling. When certain fundamental compiler features change, a starr which is aware of the change must be present or the build cannot be bootstrapped. So over the years many starrs have been checked into the repository. The "starr files", at present, are:

  • lib/scala-library.jar
  • lib/scala-compiler.jar
  • lib/scala-library-src.jar

With each new starr, the total size of the repository jumps by approximately the size of those jars, currently 20MB. In a git mirror which preserves all the svn history:

% git log --format=oneline  -- lib/scala-library.jar |wc -l
       140
% du -hs scala-full
    1.0G	scala-full

Without historical starrs the repository will be a fraction of that. Since rewriting the git history renders all clones incompatible, we will only do it once. (Actually it has already been done once -- github.com/paulp/scala-full is the complete one, github.com/scala/scala purges many starrs but has reaccumulated a dozen or so.) But we will only do it once more, because we don't do it until we can make it unnecessary to do it again, as described herein.

Automatic Starr Downloading

Every revision of scala which acts as starr in the history (i.e. the 140 revisions indicated above) needs to be published as versioned maven/ivy artifacts so they can be expressed as standard dependencies. They can be named or versioned to remain distinct from the regular scala-library and scala-compiler, and probably should be: they will only be used to bootstrap compiler builds.

In order for people to be able to check out older versions of scala and build them, we need to travel back in time to those revisions and inject on-demand starr downloading into the build process. The exact mechanism yet to be decided, but it should be as simple as possible and require the minimum possible changes to existing files.

Proposal: store a metadata file which is changed whenver a new starr was introduced.

# starr.properties
repo=http://www.typesafe.com/scala-starr/
starr=r24749

Create an xml file at the beginning of time with the following. (Wherever you see "library" think "library, compiler, and optionally src")

  // starr-build.xml
  // pretend it's xml
  if (lib/scala-library.jar exists and has the right version) proceed
  else { 
    if (lib/starr/scala-library-${starr}.jar does not exist) { download it }
    rm lib/scala-library.jar ;
    ln -s lib/starr/scala-library-${starr}.jar lib/scala-library.jar
    proceed
  }

(lib/starr is an svn-and-git-ignored local cache of starrs. This could use the ivy or maven cache if it matters.)

Add a line to scala's build.xm which runs starr-build.xml before anything else.

** To Be Determined **

The starr jars one finds in svn are binary blobs. They might have been built by anyone and contain anything. Is it more important to be faithful to the history (i.e. we could put those exact blobs up as maven artifacts) or to be consistent even if it risks changing the outcome of a build? In the latter case we would determine what the version of starr ought to be at that point in time and rebuild all starrs in exactly the same way.

I vote for the second, but I can understand preferring the first.

The Rewrite

Once we have the above in hand, we can undertake the rewriting.

  1. start with a clean git-svn mirror containing all history and all files
  2. Using git filter-branch or equivalent, inject starr-build.xml and starr.properties build.xml at the beginning of time.
  3. Then rewrite every commit which touches "starr files" not to check in binaries.
  4. Instead, those commits update starr.properties with that version.
  5. Jump through the git hoops to really and truly purge the binaries
  6. Push it to scala/scala and flip the switch.

As soon as we have the compiler being built with sbt 0.10, we can express starr as a simple dependency of the locker project. But history is important and a git repo which is 20x bigger than necessary is unacceptable, so we need to integrate it into ant even if we moved to sbt today.

What did I miss?

Addenda to integrate:

djspiewak: Might be worth adding a filter-branch pass to do some cosmetic cleanup. Removing the gratuitous git-svn meta additions to commit messages is an obvious one, but a quick glance at the Git mirror hints that you haven't been using an authors file with git-svn. That's something which I think is important if we want the history to be coherant between the old SVN-imported commits and the new Git ones (e.g. if Martin made a commit against the Git repo, it would show up as "Martin Odersky odersky@epfl.iforget.theaddress.ch", but when looking at a historical commit he made, the attribution would be simply "odersky"). This can be done either with filter-branch (see: http://stackoverflow.com/questions/392332/retroactively-correct-authors-with-git-svn) or by running a clean import from SVN.

retronym: Here's some additional documentation from GitHub on that matter. If the email mapped by the authors file matches that of a registered user on GitHub, the commits are linked to that user.

paulp again: Starrs have to be forever if we want the historical mechanism to work, so they are "releases" according to some reasonable test, but they can't have any "release process", i.e. we have to be able to build them and ship them immediately because sometimes it is too difficult to bootstrap unless you can check in a new starr and new code together.

So living in that weird space, I think we keep versioning them according to the trunk revision, e.g. starr-library-r12345.jar.

[Delete discussion of svn revisions in favor of] Tags: the rewritten history should be tagged everywhere it's meaningful to do so. With those in place we can use git-describe to generate more digestible references to individual commits than shas.

Clone this wiki locally