The code behind GitHub's new Code Navigation

The code behind GitHub's new Code Navigation

GitHub recently announced two new code navigation features:

  1. Jump to definition – quickly view the source for any given method.
  2. Find all references – Identify all references of a function within the same repo.

I have been waiting a long time for these features to surface, and best of all, the code behind all this is open source!

From the top-level, they're using a Haskell library called Semantic.  Semantic can parse, analyze, and compare source code.  As of the last update to this post, it supports Ruby, JavaScript, TypeScript, Python, Go, PHP, Java, JSON, JSX, Haskell, and Markdown.

For a full list of examples, check out the Semantic usage docs.  It comes with a great CLI that allows you to:

Generate parse trees

$ semantic parse test.A.py
(Statements
  (Annotation
    (Function
      (Identifier)
      (Identifier)
      (Return
        (Identifier)))
    (Empty))
  (Call
    (Identifier)
    (Call
      (Identifier)
      (TextElement)
      (Empty))
    (Empty)))
Generate an abstract syntax tree (AST)

Generate syntax aware diffs

$ semantic diff test.A.py test.B.py
(Statements
  (Annotation
    (Function
    { (Identifier)
    ->(Identifier) }
      (Identifier)
      (Return
        (Identifier)))
    (Empty))
  (Call
    (Identifier)
    (Call
    { (Identifier)
    ->(Identifier) }
      (TextElement)
      (Empty))
    (Empty)))

Display import call graphs

semantic graph main.py | dot -Tsvg > main.html && open main.html

Display call graphs

semantic graph --calls main.py | dot -Tsvg > main.html && open main.html

Making Sense of Your Code

The code parsing is done with the tree-sitter library which appears to be C with a handful of language bindings (the cli is in Rust!).

[Tree-sitter] can build a concrete syntax tree for a source file and efficiently update the syntax tree as the source file is edited

Languages are supported by writing language-specific parsers.  These parsers primarily consist of two parts:

  1. Grammar –  contains various rules for how the language is formatted
  2. Lexical Analysis – patterning individual character groups (for lexing).

Example Grammar for parsing simple ruby methods from tree-sitter-ruby:

{
  "name": "ruby",
  "word": "identifier",
  "rules": {
    "method": {
      "type": "SEQ",
      "members": [
        {
          "type": "STRING",
          "value": "def"
        },
        {
          "type": "SYMBOL",
          "name": "_method_rest"
        }
      ]
    }
  }
}
Ref: https://github.com/tree-sitter/tree-sitter-ruby/blob/master/src/grammar.json

Tree-sitter has an awesome Playground where you can explore the parse trees it generates for any given code.

When you feed code to the Semantic library, it generates parse trees using tree-sitter, and then uses the language's grammar (e.g. from tree-sitter-ruby) to parse the parse trees and map AST nodes  (many of which are already available from the library itself for: literals, expressions, statements, and types).

While they hope to drop this step in the future, once the AST nodes are mapped, a second round of parsing is done to wrap them into data structures the library can comprehend (for more on this, see the "Assignment" documentation).

There are quite a few more things that happen to analyze and compare source code, but I'll leave the explanation it to their great documentation on program analysis.

Thank you GitHub!  Keep up the solid work :-)