All data from the semantic analysis of a source file is dumped out to
JSON. For a file with path ${path}
from the repository root, the
analysis data is stored at ${index}/${tree_name}/analysis/${path}
.
The analysis data is broken into records. Each record corresponds to an identifier in the original source code. Each line of the file contains a record. (Technically the file is not JSON itself, but a series of JSON objects delimited by line breaks.)
Analysis records are currently generated from:
scripts/js-analyze.js
for the JavaScript analysis.scripts/idl-analyze.py
for IDL files.clang-plugin/MozsearchIndexer.cpp
for C++ files.tools/src/bin/rust-indexer.rs
for Rust files.
Analysis records may also be downloaded from Taskcluster for
mozilla-central builds (this can be viewed as an optimization to
avoid rebuilding mozilla-central as part of indexing). In fact,
records may be downloaded for multiple platforms, in which case
the scripts/merge-analyses.py
script is used to combine the
different analyses from different platforms for a given source
file.
Analysis records are consumed from Rust code in
tools/src/analysis.rs
.
There are two kinds of records: sources and targets. Each source record generates one or more context menu items when the user clicks on an identifier. Each target record corresponds to a place in the source code where things are to be found. For the most part, there will be one source record and one target record for a given identifier. However, in the case of C++ inheritance, there may be one source record and multiple target records.
Here is an example analysis. First, some JavaScript code:
let x = {a: 1};
dump(x.a);
Then its analysis:
{"loc":"1:4-5","source":1,"syntax":"def,prop","pretty":"property x","sym":"#x"}
{"loc":"1:4","target":1,"kind":"def","pretty":"x","sym":"#x"}
{"loc":"1:9-10","source":1,"syntax":"def,prop","pretty":"property a","sym":"#a"}
{"loc":"1:9","target":1,"kind":"def","pretty":"a","sym":"#a"}
{"loc":"1:9-10","source":1,"syntax":"def,prop","pretty":"property x.a","sym":"x#a"}
{"loc":"1:9","target":1,"kind":"def","pretty":"x.a","sym":"x#a"}
{"loc":"2:0-4","source":1,"syntax":"use,prop","pretty":"property dump","sym":"#dump"}
{"loc":"2:0","target":1,"kind":"use","pretty":"dump","sym":"#dump"}
{"loc":"2:5-6","source":1,"syntax":"use,prop","pretty":"property x","sym":"#x"}
{"loc":"2:5","target":1,"kind":"use","pretty":"x","sym":"#x"}
{"loc":"2:7-8","source":1,"syntax":"use,prop","pretty":"property a","sym":"#a"}
{"loc":"2:7","target":1,"kind":"use","pretty":"a","sym":"#a"}
{"loc":"2:7-8","source":1,"syntax":"use,prop","pretty":"property x.a","sym":"x#a"}
{"loc":"2:7","target":1,"kind":"use","pretty":"x.a","sym":"x#a"}
In JavaScript, there is one source and one target for each
identifier. Both kinds of records include a location, of the form
${lineno}:${colno}
for targets and
${lineno}:${start_colno}-${end_colno}
for sources.
Both kinds of nodes also contain a sym
property, which is how
sources are linked to targets. Source nodes are allowed to contain a
comma-delimited list of symbols. The results for all of these symbols
are combined in search results. Target records can only contain a
single symbol.
In JavaScript, the symbol can take three forms:
-
${file_index}-${var_index}
. All local variables are assigned a number within the file. In addition, each JS file has a unique number assigned to it. Combining them, we can generate a unique symbol for every variable in the repository. Note that top-level variables are considered properties of the global object, so they use the next form. -
#prop
. A propertyprop
of an object is given the symbol#prop
. Properties used in different files will get the same symbol, since they might be for the same object. -
object#prop
. In some cases, mozsearch is able to infer a static name for the object in which a property lives. It can do this for object literals (i.e.,let x = {prop: ...};
) as well as cases likeFoo.method = ...;
. These names aren't always useful, but they often are. In all such cases, mozsearch also generates analysis records for the bare property names (#prop
).
There are a variety of symbol names for C++ code:
-
For functions, the mangled name of the function is used. This allows mozsearch to distinguish overloads.
-
For local variables, the symbol is
V_${variable_location_hash}_${variable_name_hash}
, where the location is the hash of the filename and line where the variable is declared. -
For anonymous types, the symbol is
T_${type_location_hash}
. -
For named types, the symbol is
T_${qualified_type_name}
. -
For anonymous namespaces, the symbol is
NS_${namespace_location_hash}
. -
For named namespaces, the symbol is
NS_${qualified_namespace_name}
. -
For fields of structs, unions, and classes, the symbol is
F_<${record_symbol}>_${field_index}
, where${record_symbol}
uses the recursively defined symbol name and${field_index}
is the index of the field starting from 0. -
For enumeration constants, the symbol is
F_<${enum_symbol}>_${constant_name}
. -
For files (for includes), the symbol prefix is
FILE_
and the following normalization rules apply. SeemangleFile
inMozsearchIndexer.cpp
for more rationale,- Any character that is not alphanumeric,
_
or/
gets uppercase hex escaped with a prefix of@
. So "path/foo.h" becomesFILE_path/foo@2Eh
. - Generated files get an extra prefix of
${PLATFORM}@
is addition to the__GENERATED__/
path prefix. So for the example of "path/foo.h" if it were generated on windows, the symbol would beFILE_windows@__GENERATED__/path/foo@2Eh
.- The platform prefix is reusing the logic from
mangleLocation
which was intended to avoid hash collisions for things that aren't equivalent, but this probably could be re-thought to factor in types and whether merge-analyses ends up merging things. Right now it creates diverging symbols that potentially don't need to diverge.
- The platform prefix is reusing the logic from
- Any character that is not alphanumeric,
Note that in some cases, the symbol for a C++ identifier might vary from one
platform to another. For example, a function signature that includes uint64_t
produces a different mangled name on macOS and Linux. In such cases, if
analysis records are generated for multiple platforms, the final source records will
contain the symbols from all the platforms merged into a comma-separated list. The
scripts/merge-analyses.py
script is used to do this.
Also, in many cases (notably for C++ and JS code), local variables have their
target records omitted, and their source records have the no_crossref
property. Since local variables are not referenced outside of a very narrow
context, this optimization helps to avoid bloating the index files with a
lot of unnecessary entries.
A source record additionally contains a syntax
property, a pretty
property,
an optional no_crossref
property, and an optional nestingRange
property.
The syntax
property describes how the identifier should be syntax highlighted.
It is a comma-delimited list of strings. Currently the only strings that have any
effect are:
def
,decl
, andidl
all cause the identifier to be shown in bold.type
causes the identifier to be shown in a different color.
The pretty
property is used to generate the context menu items for
the identifier. It should contain a human-readable description like
constructor nsDocShell::nsDocShell
or property SessionStore.getTabState
.
The no_crossref
property, if set, always has a value of 1
, and indicates
that this identifier will have no target records and does not participate
in cross-referencing.
The nestingRange
property, if present, contains a value analogous to a
Clang SourceRange, with a string representation of "line1:col1-line2:col2" where
lines are 1-based and columns are 0-based. This currently powers
"position:sticky" source code display so that when you are inside hierarchically
nested definitions you can immediately understand where you are and what the
scope is without needing to manually scroll up.
Ideally, the nesting range's two points are the start of the token creating a nested block and the start of the token ending a nesting block. ("{" and "}" in C++ and the 'b' in "begin" for Pascal. This is consistent with how Clang's AST representations.) However, when an analyzer doesn't have an exact AST to work with (ex: rust as of writing this), we may do our best to simply specify a conservative range of lines covering the children of a definition.
In cases where we have accurate nestingRange information, we may be able to do neat tricks like highlight the area between braces or implement code folding.
These fields, like the "structured" record type, are in flux as part of work on the fancy branch.
A type
may be emitted which is a string representation of the compiler's
understanding of the type/return type of something. This will include
qualifiers like const/it's a pointer/it's a reference.
A typesym
symbol may be emitted when the type corresponds to a type indexed by
searchfox (which will inherently not include qualifiers unless we're talking
about method signatures).
Target records additionally contain a kind
property, a pretty
property,
and optionally context
, contextsym
, and peekRange
properties.
The kind
property should be one of use
, def
, decl
, assign
,
or idl
. This property determines whether the identifier will appear
under the "Uses", "Definitions", "Declarations", "Assignments", or
"IDL" category of the search results page.
The pretty
property is also used for the context menu. If a target
record is the only def
target for a given symbol, then the context
menu for any source records with that symbol will contain a Go to ${pretty}
entry, where ${pretty}
is the target's pretty
property.
The context
and contextsym
properties capture an enclosing context
for the identifier, such as the enclosing function. These are used to
link to the context in search results that include the target record.
The peekRange
property is a range of lines that appears to be
currently unused.
Structured records are an attempt to provide richer information about types and
their relationships. This is an evolving area of Searchfox, and is subject to
change. This expected change also informs the design of the record format,
which is just a wrapper around an opaque JSON structure as far as analysis.rs
is concerned. (All other record types are flat with a well known set of
keys/values.)
We emit structured records at the point of their definition. Structured records reference other structured records by their searchfox symbol identifier. A structured record itself won't embed child types (even if they're not visible outside the type) but instead reference them (which may involve searchfox generating identifiers that have no meaning outside of searchfox, like is done for locals).
Because of the realities of compilation, we know a parent class won't necessarily know all of its subclasses, so all type references in emitted records will generally only be upwards/sideways, never downwards. We depend on cross-referencing for determining sets of children/etc.
So for a class, we might expect the structured record in the analysis file to contain:
- A list of its known super-classes.
- A list of its fields and members.
But it would not contain:
- A list of its known sub-classes. This will be determined by cross-referencing.
Clang defines a "Character Units" type
CharUnits
that basically means bytes. Searchfox just calls them bytes and assumes an
8-bit byte because strings are such a large part of the Firefox codebase that
calling bytes "char units" just adds terrifying confusion.
Raw record info. These are attributes that will be found in the analysis files.
pretty
: The pretty name/identifier for this structured symbol info.sym
: The searchfox symbol for this symbol.kind
: A string with one of the following values:file
: This is a synthetic symbol for the file (FILE_normalized_blah
) for purposes like diagramming where we always operate in symbol space. Keep in mind that we also store per-file info inconcise-per-file-info.json
and which corresponds toConcisePerFileInfo
structures. That representation is intended for file-centric queries, but we can absolutely mirror data from that rep into this rep when needed/appropriate.enum
: XPIDL enums do this at leastclass
struct
union
method
: It's a method on a class/struct. It will haveoverrides
.function
: A boring function. Nooverrides
.field
: A member of a class/struct. Right now the field record has minimal info with the intent being that the data canonically lives on the parent symbol and this just provides theparentsym
necessary to get to that info. But this potentially needs more thought. TODO: Think more on this!ipc
: An IPC function where there's a send method and a recv method.namespace
: We currently don't emit this for structured records from the C++ indexer, but we probably want to start to; currently adding this for tree-sitter based tokenization where it's potentially useful to be able to refer to namespaces.
parentsym
: For methods and fields, the symbol of the record to which they belong. The current intent is that this is not populated for namespace purposes. (That is, for a class "Bar" in namespace "foo" with pretty name "foo::Bar", "Bar" would not have a parentsym.) The rationale is that we expect there to be a ton of stuff in any given namespace and we already have means of looking up the contents of a namespace via theidentifiers
table. This may want to evolve in the future, however. Note that this attribute is not currently used for any cross-referencing, it's just meta. (And note that target records'contextsym
should frequently be the same when it's not just a namespace.)slotOwner
: For bindings, an optionalStructuredBindingSlotInfo
as found inbindingSlots
but in the opposite direction and identifying the owning symbol.implKind
: Assume to be "impl" if not present. Reasonable values:idl
: This is the semantic definition in XPIDL, IPDL, WebIDL, etc.binding
: Ex: WebIDL binding glue. Auto-generated and perhaps of interest but of less interst than theidl
or theimpl
. Note that in cases like XPIDL, the binding machinery may be invisible for searchfox's context menus and the like.impl
: By default, most things will be "impl". But when WebIDL/etc. are involved this will be the actual implementation.
sizeBytes
: Size in bytes. Not present for method/function.bindingSlots
: For binding definitions, an array ofStructuredBindingSlotInfo
:slotKind
: SeeBindingSlotKind
slotLang
: SeeBindingSlotLang
ownerLang
: SeeBindingOwnerLang
sym
ontologySlots
: For semantics introduced viaontology-mapping.toml
, an array ofOntologySlotInfo
:slotKind
: SeeOntologySlotKind
syms
: An array of target symbols. Because of cases like RunnableConstructor this needs to be a set and not just a singular sym.
supers
: For class-like symbols, an array of:sym
: The searchfox symbol for this super.props
: An array of strings whose presence indicates a semantic attribute:virtual
: It's a virtual base class if present.
methods
: For class-like symbols, an array of:pretty
: The pretty name/identifier for this method.sym
: The searchfox symbol for this method.props
: An array of strings whose presence indicates a semantic attribute:static
: It's a static method (implies not "instance").instance
: It's a method on the instance (implies not "static").virtual
: It's a virtual method.user
: It's user-provided.defaulted
: It's defaulted per C++0x, AKA someone did= default
.deleted
: It's deleted per C++0x, AKA someone did= delete
.constexpr
: It's marked (C++11) constexpr!
fields
: For data-structure-like symbols, an array of:pretty
: The pretty name/identifier for this field.sym
: The searchfox symbol for this field.type
: Compiler's string representation of the type. This is still somewhat experimental and the same thing we emit for source records.typesym
: The searchfox symbol for the type of this field.offsetBytes
: Byte offset of the field within this immediate structure.bitPositions
: Only present in bit-fields. Object with the following properties whose names are derived from the AST dumper:begin
width
sizeBytes
: Only present in non-bit-fields. The size of the fieldin bytes.
overrides
: For methods, an array of method signatures that are overridden.sym
: The searchfox symbol for the referenced method.
props
: For methods, an array of strings whose presence indicates a semantic attribute. These are the same as the props under a class-like symbol'smethods
array.static
: It's a static method (implies not "instance").instance
: It's a method on the instance (implies not "static").virtual
: It's a virtual method.user
: It's user-provided.defaulted
: It's defaulted per C++0x, AKA someone did= default
.deleted
: It's deleted per C++0x, AKA someone did= delete
.constexpr
: It's marked (C++11) constexpr!
Attributes added/updated by cross-referencing:
subclasses
: Derived fromsupers
.pretty
sym
overriddenBy
: Derived fromoverrides
.pretty
sym
Attributes optionally added by merging (see more on this below). These will only be present when the structured records differed between platforms. If every platform had the same structured record contents, that representation is left as-is.
variants
: A list of structured record objects with the above (pre-cross-referencing) attributes. Each of these records will also have aplatforms
Array of string platform names that had these attributes.platforms
: Array of string platform names whose structured records were the same and chosen to be the canonical variant.
Note that there is also documentation in merge-analyses.rs
alongside the code
implementing this logic which may be more straightforward to understand.
Merging is performed by:
- Hashing all of the structured records for a given symbol so that we can detect equivalent structured records. (The records don't include the platform name at the time.)
- Checking if all the structured records were the same, and if so, just spitting out the singleton record as-is.
- If the records differed, which will frequently be the case at the time of
having written this where we built 32-bit ARM builds in addition to the 64-bit
builds for Windows/OS X/Linux, we arbitrarily pick a record to be the
"canonical" structured record "variant".
- Currently this is the last record we saw because this will never be the 32-bit ARM record with our current config where arm gets listed first.
- The canonical variant ends up looking exactly like it would have without
merging except we add a
variants
attribute which is a list of all of the other records we saw (consolidated by hashing). Every variant (including the top-level canonical variant) gets aplatforms
attribute that is just an Array of Strings that are the platform names as used by searchfox.
C++ inheritance is one of the most tricky issues to deal with in an analysis like this. When the user clicks on a method call and searches, should they find method definitions/calls for other classes? If so, how should this work? Mozsearch is pretty naive here, but it generally is good enough to find an over-approximation of the desired results.
Mangled C++ names include the concrete class on which a method is defined. So each symbol is automatically tagged with the name of the class in which it's defined.
First let's consider method calls. Consider a method call
obj->f
. The mangled name of f
will include the class name of obj
(or, when it doesn't implement f
itself, the first type above it in
the inheritance chain that does define f
). In this case, mozsearch
generates one source and one target record. The symbol in both records
is for f
's mangled name.
When a method T::f
is defined, mozsearch finds all the methods that
f
overrides in T
's direct and indirect superclasses (including via
multiple inheritance). It generates a single source record whose sym
property contains all of these symbols in a comma-delimited
list. Consequently, searching from this method definition will find
all calls to this implementation of f
, either directly through T
or via supertypes of T
(which may invoke this f
through dynamic
dispatch).
We also generate one target record for each override of f
in T
's
superclasses. This way, searching from any method call, either through
T
or some superclass of T
that might dispatch to this f
via
dynamic dispatch, will find this f
.
One flaw in this system is that it doesn't consider the full shape of the inheritance tree. Consider this case:
class A {
virtual void f() { return 1; }
};
class B : public A {};
class C : public A {
virtual void f() override { return 2; } // Y
};
B* b;
b->f(); // Z
We will generate a source record at the line marked Z with
one symbol, for A::f
. At line Y, we generate a target record for
C::f
that has symbols for C::f
and A::f
. Consequently, searching
at line Z will find C::f
. However, it's not actually possible for
b->f()
to call C::f
. A better technique would recognize this. That
is future work.
The clang plugin for indexing C++ files will typically analyze a given header file many times. In some cases, the analysis records generated for the header file will differ from one time to another. This happens most often in the case of macros and templates. Different files that include a header will instantiate a macro or template in different ways, which may produce different analysis results.
To compensate for this issue, the clang plugin checks to see if an analysis file already exists before writing one out. If one does exist, it reads it in and merges its data with the new data. The merge happens by combining the new and old records in a single vector, sorting the vector (using an arbitrary sort order), and removing duplicates. This new list of records is then written to disk. During this time, the file is kept locked to avoid issues with parallel compilation.