Discussion:
[PATCH] D39050: Add index-while-building support to Clang
(too old to reply)
Nathan Hawes via Phabricator via cfe-commits
2017-10-18 13:09:47 UTC
Permalink
nathawes created this revision.
Herald added a subscriber: mgorny.

Adds a new -index-store-path option that causes Clang to additionally collect and output source code indexing information to the supplied path.

This is done by wrapping the FrontendAction otherwise setup by the invocation with a WrappingIndexRecordAction. This action simply delegates to the wrapped action, but additionally multiplexes in its own IndexASTConsumer to collect symbol information from the AST and tracks the source file and module dependencies of the translation unit (via the IndexDependencyProvider class). When the action completes, it then writes this information out to the supplied index store path in the form of a unit file, which stores the dependency information of the translation unit, and record files, that store the symbol and symbol occurrences seen in each source file. These are written out in the LLVM Bitstream format.

This patch also adds a new IndexStore library under tools/ that abstracts away the structure and format of the index store so that clients (e.g. an IDE indexer service) can easily read the data it contains. It also notifies clients whenever data in the store is updated.

For a better (and more detailed) description of these changes, see the design document at: https://docs.google.com/document/d/1cH2sTpgSnJZCkZtJl1aY-rzy4uGPcrI-6RrUpdATO2Q/edit?usp=sharing

and the mailing list discussion 'RFC: Adding index-while-building support to Clang'


https://reviews.llvm.org/D39050

Files:
include/clang/Basic/DiagnosticFrontendKinds.td
include/clang/Basic/DiagnosticGroups.td
include/clang/DirectoryWatcher/DirectoryWatcher.h
include/clang/Driver/Job.h
include/clang/Driver/Options.td
include/clang/Frontend/CompilerInstance.h
include/clang/Frontend/FrontendOptions.h
include/clang/Index/IndexDataStore.h
include/clang/Index/IndexDataStoreSymbolUtils.h
include/clang/Index/IndexRecordReader.h
include/clang/Index/IndexRecordWriter.h
include/clang/Index/IndexUnitReader.h
include/clang/Index/IndexUnitWriter.h
include/clang/Index/IndexingAction.h
include/indexstore/IndexStoreCXX.h
include/indexstore/indexstore.h
lib/CMakeLists.txt
lib/DirectoryWatcher/CMakeLists.txt
lib/DirectoryWatcher/DirectoryWatcher.cpp
lib/Driver/Driver.cpp
lib/Driver/Job.cpp
lib/Driver/ToolChains/Clang.cpp
lib/Driver/ToolChains/Darwin.cpp
lib/Frontend/CompilerInstance.cpp
lib/Frontend/CompilerInvocation.cpp
lib/FrontendTool/CMakeLists.txt
lib/FrontendTool/ExecuteCompilerInvocation.cpp
lib/Index/BitstreamVisitor.h
lib/Index/CMakeLists.txt
lib/Index/ClangIndexRecordWriter.cpp
lib/Index/ClangIndexRecordWriter.h
lib/Index/FileIndexRecord.cpp
lib/Index/FileIndexRecord.h
lib/Index/IndexDataStore.cpp
lib/Index/IndexDataStoreUtils.cpp
lib/Index/IndexDataStoreUtils.h
lib/Index/IndexRecordHasher.cpp
lib/Index/IndexRecordHasher.h
lib/Index/IndexRecordReader.cpp
lib/Index/IndexRecordWriter.cpp
lib/Index/IndexUnitReader.cpp
lib/Index/IndexUnitWriter.cpp
lib/Index/IndexingAction.cpp
lib/Index/IndexingContext.cpp
lib/Index/IndexingContext.h
test/Index/Store/Inputs/head.h
test/Index/Store/Inputs/json.c.json
test/Index/Store/Inputs/module/ModDep.h
test/Index/Store/Inputs/module/ModSystem.h
test/Index/Store/Inputs/module/ModTop.h
test/Index/Store/Inputs/module/ModTopSub1.h
test/Index/Store/Inputs/module/ModTopSub2.h
test/Index/Store/Inputs/module/module.modulemap
test/Index/Store/Inputs/overlay.yaml
test/Index/Store/Inputs/print-unit.h
test/Index/Store/Inputs/sys/another.h
test/Index/Store/Inputs/sys/syshead.h
test/Index/Store/Inputs/test1.c
test/Index/Store/Inputs/test2.c
test/Index/Store/Inputs/test3.cpp
test/Index/Store/Inputs/using-overlay.h
test/Index/Store/assembly-invocation.c
test/Index/Store/external-source-symbol-hash.m
test/Index/Store/handle-prebuilt-module.m
test/Index/Store/json-with-module.m
test/Index/Store/json-with-module.m.json
test/Index/Store/json-with-pch.c
test/Index/Store/json-with-pch.c.json
test/Index/Store/json.c
test/Index/Store/print-record.mm
test/Index/Store/print-unit.c
test/Index/Store/print-units-with-modules.m
test/Index/Store/print-units-with-pch.c
test/Index/Store/record-hash-crash-invalid-name.cpp
test/Index/Store/record-hash-crash.cpp
test/Index/Store/record-hash-using.cpp
test/Index/Store/record-hash.cpp
test/Index/Store/relative-out-path.c
test/Index/Store/syntax-only.c
test/Index/Store/unit-with-vfs.c
test/Index/Store/unit-workdir-prefix.c
test/Index/Store/using-libstdcpp-arc.mm
tools/CMakeLists.txt
tools/IndexStore/CMakeLists.txt
tools/IndexStore/IndexStore.cpp
tools/IndexStore/IndexStore.exports
tools/c-index-test/CMakeLists.txt
tools/c-index-test/JSONAggregation.cpp
tools/c-index-test/JSONAggregation.h
tools/c-index-test/core_main.cpp
Alex Lorenz via Phabricator via cfe-commits
2017-10-18 13:59:35 UTC
Permalink
arphaman added a comment.

I think this patch should be split into a number of smaller patches to help the review process.

Things like `tools/IndexStore`, `DirectoryWatcher` and other components that are not directly needed right now should definitely be in their own patches.
It would be nice to find some way to split the implementation into multiple patches as well.


https://reviews.llvm.org/D39050
Nathan Hawes via Phabricator via cfe-commits
2017-10-31 22:11:22 UTC
Permalink
nathawes updated this revision to Diff 120916.
nathawes edited the summary of this revision.
nathawes added a comment.

I've split out the parts for reading and managing the store data (the IndexStore and DirectoryWatcher libraries) into a separate patch, so this one is now just about the -index-store-path option and everything necessary to write out the index data to the provided path.


https://reviews.llvm.org/D39050

Files:
include/clang/Basic/DiagnosticFrontendKinds.td
include/clang/Basic/DiagnosticGroups.td
include/clang/Driver/Job.h
include/clang/Driver/Options.td
include/clang/Frontend/CompilerInstance.h
include/clang/Frontend/FrontendOptions.h
include/clang/Index/IndexDataStoreSymbolUtils.h
include/clang/Index/IndexRecordWriter.h
include/clang/Index/IndexUnitWriter.h
include/clang/Index/IndexingAction.h
include/indexstore/indexstore.h
lib/Driver/Driver.cpp
lib/Driver/Job.cpp
lib/Driver/ToolChains/Clang.cpp
lib/Driver/ToolChains/Darwin.cpp
lib/Frontend/CompilerInstance.cpp
lib/Frontend/CompilerInvocation.cpp
lib/FrontendTool/CMakeLists.txt
lib/FrontendTool/ExecuteCompilerInvocation.cpp
lib/Index/BitstreamVisitor.h
lib/Index/CMakeLists.txt
lib/Index/ClangIndexRecordWriter.cpp
lib/Index/ClangIndexRecordWriter.h
lib/Index/FileIndexRecord.cpp
lib/Index/FileIndexRecord.h
lib/Index/IndexDataStore.cpp
lib/Index/IndexDataStoreUtils.cpp
lib/Index/IndexDataStoreUtils.h
lib/Index/IndexRecordHasher.cpp
lib/Index/IndexRecordHasher.h
lib/Index/IndexRecordWriter.cpp
lib/Index/IndexUnitWriter.cpp
lib/Index/IndexingAction.cpp
lib/Index/IndexingContext.cpp
lib/Index/IndexingContext.h
test/Index/Store/assembly-invocation.c
test/Index/Store/record-hash-using.cpp
test/Index/Store/record-hash.cpp
test/Index/Store/relative-out-path.c
Alex Lorenz via Phabricator via cfe-commits
2017-10-31 23:08:00 UTC
Permalink
arphaman added inline comments.


================
Comment at: include/clang/Index/IndexDataStoreSymbolUtils.h:13
+
+#include "indexstore/indexstore.h"
+#include "clang/Index/IndexSymbol.h"
----------------
It looks to me like this header, `"indexstore/indexstore.h"`, and `IndexDataStoreUtils.cpp` are utilities just for the C API, so could we take it out of here as well?


================
Comment at: lib/Driver/ToolChains/Clang.cpp:3597
+
+ if (const char *IdxStorePath = ::getenv("CLANG_PROJECT_INDEX_PATH")) {
+ CmdArgs.push_back("-index-store-path");
----------------
What is this environment variable used for? And why does it imply the other two flags?


================
Comment at: lib/Frontend/CompilerInstance.cpp:1153
+ // Pass along the GenModuleActionWrapper callback
+ auto wrapGenModuleAction = ImportingInstance.getGenModuleActionWrapper();
+ Instance.setGenModuleActionWrapper(wrapGenModuleAction);
----------------
Please start your variable names with uppercase (http://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly).


================
Comment at: lib/Index/IndexRecordHasher.cpp:103
+ COMBINE_HASH(Hasher.hash(param->getType()));
+ }
+ return Hash;
----------------
Should you hash the return type as well?


================
Comment at: lib/Index/IndexRecordHasher.cpp:137
+ Loc = SM.getFileLoc(Loc);
+ const std::pair<FileID, unsigned> &Decomposed = SM.getDecomposedLoc(Loc);
+ const FileEntry *FE = SM.getFileEntryForID(Decomposed.first);
----------------
No need to use a reference type here.


================
Comment at: lib/Index/IndexRecordHasher.cpp:146
+ if (IncludeOffset) {
+ // Use the offest into the FileID to represent the location. Using
+ // a line/column can cause us to look back at the original source file,
----------------
offset


================
Comment at: lib/Index/IndexRecordHasher.cpp:204
+ if (Q.hasConst())
+ qVal |= 0x1;
+ if (Q.hasVolatile())
----------------
You can use `Qualifiers::Const` here or make your own enum instead of raw constants.


================
Comment at: lib/Index/IndexRecordHasher.cpp:255
+ continue;
+ }
+
----------------
There are other types in Clang which aren't hashed it seems. Can you add a general FIXME for them here?


================
Comment at: lib/Index/IndexRecordWriter.cpp:278
+ // Write the record header.
+ auto *State = new RecordState(RecordPath.str());
+ Record = State;
----------------
It might be better to forward declare `RecordState` in the header, use `unique_ptr<RecordState>` field instead of `void *` in the class and use `llvm::make_unique` here. Then you can remove the `ScopedDelete` RAII struct in the next function.


================
Comment at: lib/Index/IndexUnitWriter.cpp:30
+
+class IndexUnitWriter::PathStorage {
+ std::string WorkDir;
----------------
What is this class responsible for?


================
Comment at: lib/Index/IndexingAction.cpp:686
+ if (RecordWriter.writeRecord(FE->getName(), Rec, Error, &RecordFile)) {
+ unsigned DiagID = Diag.getCustomDiagID(DiagnosticsEngine::Error,
+ "failed writing record '%0': %1");
----------------
We might want to start using a new diagnostic group for index-while-building errors instead of the custom ones.


================
Comment at: lib/Index/IndexingContext.cpp:162
+ // treated as system one.
+ if (path == "/")
+ path = StringRef();
----------------
It might be worth investigating if you can use any of the LLVM's path APIs here instead of doing a UNIX-specific check.


https://reviews.llvm.org/D39050
Nathan Hawes via Phabricator via cfe-commits
2017-10-31 23:45:56 UTC
Permalink
nathawes planned changes to this revision.
nathawes added a comment.

Thanks @arphaman! I'll work through your comments and update.



================
Comment at: include/clang/Index/IndexDataStoreSymbolUtils.h:13
+
+#include "indexstore/indexstore.h"
+#include "clang/Index/IndexSymbol.h"
----------------
Post by Alex Lorenz via Phabricator via cfe-commits
It looks to me like this header, `"indexstore/indexstore.h"`, and `IndexDataStoreUtils.cpp` are utilities just for the C API, so could we take it out of here as well?
They're used by IndexRecordWriter below to convert from libIndex's representation of things to the index store's.


https://reviews.llvm.org/D39050
Nathan Hawes via Phabricator via cfe-commits
2017-11-07 02:49:07 UTC
Permalink
nathawes added inline comments.


================
Comment at: lib/Index/IndexRecordHasher.cpp:103
+ COMBINE_HASH(Hasher.hash(param->getType()));
+ }
+ return Hash;
----------------
Post by Alex Lorenz via Phabricator via cfe-commits
Should you hash the return type as well?
The return type doesn't affect the function's USR, so there's no need to consider it when hashing the function decl. The hashing here is happening per decl occurrence (source offset + role set + Decl) collected by the index AST walker, so changing the return type will still change the record hash when any decl occurrences it contains are hashed in.


https://reviews.llvm.org/D39050
Nathan Hawes via Phabricator via cfe-commits
2017-11-07 03:24:36 UTC
Permalink
nathawes updated this revision to Diff 121833.
nathawes added a comment.

Based on @arphaman's feedback:

- Pulled the index store related diagnostics out into their own category/diagnostic group
- Removed the CLANG_PROJECT_INDEX_PATH env var check.
- Swapped "/" used in a few places as a separator/root with the equivalent llvm::sys::path call.
- Fixed the typo/convention/documentation issues and simplifications pointed out so far


https://reviews.llvm.org/D39050

Files:
include/clang/Basic/AllDiagnostics.h
include/clang/Basic/CMakeLists.txt
include/clang/Basic/Diagnostic.td
include/clang/Basic/DiagnosticGroups.td
include/clang/Basic/DiagnosticIDs.h
include/clang/Basic/DiagnosticIndexKinds.td
include/clang/Driver/Job.h
include/clang/Driver/Options.td
include/clang/Frontend/CompilerInstance.h
include/clang/Frontend/FrontendOptions.h
include/clang/Index/IndexDataStoreSymbolUtils.h
include/clang/Index/IndexDiagnostic.h
include/clang/Index/IndexRecordWriter.h
include/clang/Index/IndexUnitWriter.h
include/clang/Index/IndexingAction.h
include/clang/module.modulemap
include/indexstore/indexstore.h
lib/Basic/DiagnosticIDs.cpp
lib/Driver/Driver.cpp
lib/Driver/Job.cpp
lib/Driver/ToolChains/Clang.cpp
lib/Driver/ToolChains/Darwin.cpp
lib/Frontend/CompilerInstance.cpp
lib/Frontend/CompilerInvocation.cpp
lib/FrontendTool/CMakeLists.txt
lib/FrontendTool/ExecuteCompilerInvocation.cpp
lib/Index/BitstreamVisitor.h
lib/Index/CMakeLists.txt
lib/Index/ClangIndexRecordWriter.cpp
lib/Index/ClangIndexRecordWriter.h
lib/Index/FileIndexRecord.cpp
lib/Index/FileIndexRecord.h
lib/Index/IndexDataStore.cpp
lib/Index/IndexDataStoreUtils.cpp
lib/Index/IndexDataStoreUtils.h
lib/Index/IndexRecordHasher.cpp
lib/Index/IndexRecordHasher.h
lib/Index/IndexRecordWriter.cpp
lib/Index/IndexUnitWriter.cpp
lib/Index/IndexingAction.cpp
lib/Index/IndexingContext.cpp
lib/Index/IndexingContext.h
test/Index/Store/assembly-invocation.c
test/Index/Store/record-hash-using.cpp
test/Index/Store/record-hash.cpp
test/Index/Store/relative-out-path.c
tools/diagtool/DiagnosticNames.cpp
Marc-Andre Laperle via Phabricator via cfe-commits
2017-11-08 03:19:48 UTC
Permalink
malaperle added inline comments.


================
Comment at: lib/Index/IndexUnitWriter.cpp:212
+ return true;
+};
+
----------------
extra semi-colon (noticed this warning while compiling)


https://reviews.llvm.org/D39050
Marc-Andre Laperle via Phabricator via cfe-commits
2017-11-08 04:58:50 UTC
Permalink
malaperle added inline comments.


================
Comment at: lib/Index/IndexingAction.cpp:562
+
+ SourceManager &SM = CI.getSourceManager();
+ DiagnosticsEngine &Diag = CI.getDiagnostics();
----------------
As a first attempt, I tried to use index::createIndexDataRecordingAction in combination with ASTUnit::LoadFromCompilerInvocationAction but one problem is that right before it calls EndSourceFileAction in LoadFromCompilerInvocationAction, it calls transferASTDataFromCompilerInstance which means that the SourceManager in CompilerInstance is nulled out as it gets "transfered" to the AST. So this line crashes in this case. To be fair, at this point I don't need the ASTUnit so I can look at executing the action differently, but I thought I'd point it out!


https://reviews.llvm.org/D39050
Marc-Andre Laperle via Phabricator via cfe-commits
2017-11-08 16:19:02 UTC
Permalink
malaperle added inline comments.


================
Comment at: lib/Index/IndexRecordWriter.cpp:155
+ if (!IsNew) {
+ llvm::errs() << "Index: Duplicate USR! " << SymInfo.USR << "\n";
+ // FIXME: print more information so it's easier to find the declaration.
----------------
I'm getting quite a bit of those while indexing Clangd, it looks like it comes from some LLVM/Support headers:
Index: Duplicate USR! c:@***@std@ST>2#NI#***@__try_lock_impl
Index: Duplicate USR! c:@***@llvm@ST>1#***@DenseMapInfo
Index: Duplicate USR! c:@***@llvm@ST>1#***@isPodLike
Index: Duplicate USR! c:@***@llvm@***@detail@ST>1#***@unit
Index: Duplicate USR! c:@***@llvm@ST>2#T#***@format_provider
Index: Duplicate USR! c:@***@llvm@ST>2#T#***@format_provider
Index: Duplicate USR! c:@***@llvm@ST>1#***@PointerLikeTypeTraits
Index: Duplicate USR! c:@***@llvm@ST>1#***@simplify_type
Index: Duplicate USR! c:@***@std@ST>1#***@atomic
Index: Duplicate USR! c:@***@llvm@ST>1#***@isPodLike
Index: Duplicate USR! c:@***@llvm@ST>1#***@DenseMapInfo

I think it would be good to have the file name at least in the log. I also assume those duplication are issues that would have to be fixed in USRGenerator (i.e. in separate patches) ?


https://reviews.llvm.org/D39050
Marc-Andre Laperle via Phabricator via cfe-commits
2017-11-08 21:40:18 UTC
Permalink
malaperle added inline comments.


================
Comment at: include/indexstore/IndexStoreCXX.h:75
+ bool foreachRelation(llvm::function_ref<bool(IndexSymbolRelation)> receiver) {
+#if INDEXSTORE_HAS_BLOCKS
+ return indexstore_occurrence_relations_apply(obj, ^bool(indexstore_symbol_relation_t sym_rel) {
----------------
I know this is an old revision but I thought I should ask for the future patch... Would it be possible to not use "blocks"? This will affect portability of the code. I'm not familiar with blocks but I would think it would be possible to replace with C++11 lambdas, or something else that's standard. I was just playing around with this code and since I used GCC, it did not work.


https://reviews.llvm.org/D39050
Eric Liu via Phabricator via cfe-commits
2017-11-09 14:54:32 UTC
Permalink
ioeric added a comment.
Post by Alex Lorenz via Phabricator via cfe-commits
I think this patch should be split into a number of smaller patches to help the review process.
Things like `tools/IndexStore`, `DirectoryWatcher` and other components that are not directly needed right now should definitely be in their own patches.
It would be nice to find some way to split the implementation into multiple patches as well.
+1.

This is a lot of work (but great work!) for one patch. Smaller/incremental patches help reviewers understand and (hopefully) capture potential improvement of the design. I would really appreciate it if you could further split the patch.

Some comments/ideas:

- The lack of tests is a bit concerning.
- I think the implementation of the index output logic (e.g. `IndexUnitWriter` and bit format file) can be abstracted away (and split into separate patches) so that you can unit-test the action with a custom/mock unit writer; this would also make the action reusable with (potentially) other storage formats.
- I would suggest that you start with a patch that implement the index action and just enough components so that you could test the action.

Thanks!


https://reviews.llvm.org/D39050
Nathan Hawes via Phabricator via cfe-commits
2017-11-09 22:46:43 UTC
Permalink
nathawes planned changes to this revision.
nathawes added a comment.
Post by Eric Liu via Phabricator via cfe-commits
Post by Alex Lorenz via Phabricator via cfe-commits
I think this patch should be split into a number of smaller patches to help the review process.
Things like `tools/IndexStore`, `DirectoryWatcher` and other components that are not directly needed right now should definitely be in their own patches.
It would be nice to find some way to split the implementation into multiple patches as well.
+1.
This is a lot of work (but great work!) for one patch. Smaller/incremental patches help reviewers understand and (hopefully) capture potential improvement of the design. I would really appreciate it if you could further split the patch.
- The lack of tests is a bit concerning.
I moved all the code for reading the index store data into a separate patch (to come after this one) in order to slim this one down for review, and most of the tests went with it because they're based around reading and dumping the stored data for FileCheck. The original version of this patch has them all (https://reviews.llvm.org/D39050?id=118854). The ones that remain here are just those checking that the unit/record files are written out and that the hashing mechanism is producing distinct record files when the symbolic content of the source file changes.
Post by Eric Liu via Phabricator via cfe-commits
- I think the implementation of the index output logic (e.g. `IndexUnitWriter` and bit format file) can be abstracted away (and split into separate patches) so that you can unit-test the action with a custom/mock unit writer; this would also make the action reusable with (potentially) other storage formats.
The added IndexRecordAction and existing IndexAction use the same functionality from libIndex to collect the indexing data, so I'm not sure mocking the unit writer to unit test IndexRecordAction would add very much value – writing the index data out is the new behavior. The existing tests for IndexAction (under test/Index/Core) are already covering the correctness of the majority of the collected indexing info and the tests coming in the follow-up patch (seen in the original version of this patch) test it's still correct after the write/read round trip.
Post by Eric Liu via Phabricator via cfe-commits
- I would suggest that you start with a patch that implement the index action and just enough components so that you could test the action.
Thanks!
https://reviews.llvm.org/D39050
Eric Liu via Phabricator via cfe-commits
2017-11-10 13:58:44 UTC
Permalink
ioeric added a comment.
Post by Nathan Hawes via Phabricator via cfe-commits
Post by Eric Liu via Phabricator via cfe-commits
- I think the implementation of the index output logic (e.g. `IndexUnitWriter` and bit format file) can be abstracted away (and split into separate patches) so that you can unit-test the action with a custom/mock unit writer; this would also make the action reusable with (potentially) other storage formats.
The added IndexRecordAction and existing IndexAction use the same functionality from libIndex to collect the indexing data, so I'm not sure mocking the unit writer to unit test IndexRecordAction would add very much value – writing the index data out is the new behavior. The existing tests for IndexAction (under test/Index/Core) are already covering the correctness of the majority of the collected indexing info and the tests coming in the follow-up patch (seen in the original version of this patch) test it's still correct after the write/read round trip.
Thanks for the clarification! I still think it's useful to decouple the IndexAction from the bit format file so that it could be reusable elsewhere. For example, I can see the index action be useful to clangd for building in-memory index.

I also tried applying your original patch locally but couldn't get it to work mostly due to portability issues (e.g. `blocks` and `if (APPLE)` in make files). AFAIK, many folks compile clang with GCC and/or without APPLE, so it's important that you get the portability right from the very beginning. Thanks!

Index-while-build is awesome! I'm looking forward to your patches!


https://reviews.llvm.org/D39050
Argyrios Kyrtzidis via Phabricator via cfe-commits
2017-11-11 01:46:24 UTC
Permalink
akyrtzi added a comment.

Hey Eric,
Post by Eric Liu via Phabricator via cfe-commits
Post by Nathan Hawes via Phabricator via cfe-commits
Post by Eric Liu via Phabricator via cfe-commits
- I think the implementation of the index output logic (e.g. `IndexUnitWriter` and bit format file) can be abstracted away (and split into separate patches) so that you can unit-test the action with a custom/mock unit writer; this would also make the action reusable with (potentially) other storage formats.
The added IndexRecordAction and existing IndexAction use the same functionality from libIndex to collect the indexing data, so I'm not sure mocking the unit writer to unit test IndexRecordAction would add very much value – writing the index data out is the new behavior. The existing tests for IndexAction (under test/Index/Core) are already covering the correctness of the majority of the collected indexing info and the tests coming in the follow-up patch (seen in the original version of this patch) test it's still correct after the write/read round trip.
Thanks for the clarification! I still think it's useful to decouple the IndexAction from the bit format file so that it could be reusable elsewhere. For example, I can see the index action be useful to clangd for building in-memory index.
As Nathan mentioned, we believe the indexing action, as it exists in the trunk, is decoupled enough to be useful, for example Marc was already able to use it and write out the indexing data in a completely different format for his fork of clangd. Of course, we are definitely interested in any additional refactorings that would structure things better and we are eager to see and discuss follow-up patches from anyone that is interested in improving the code, but could we treat this as potential follow-up improvements ?

We are eager to provide the functionality so others can start experimenting with it; I'd propose that we discuss ideas for refactoring of the code as follow-up, what do you think ? Getting the initial functionality in and iterating on it, while getting more experience with applying it on various use-cases, is a common operating mindset of the llvm/clang projects.
Post by Eric Liu via Phabricator via cfe-commits
I also tried applying your original patch locally but couldn't get it to work mostly due to portability issues (e.g. `blocks` and `if (APPLE)` in make files). AFAIK, many folks compile clang with GCC and/or without APPLE, so it's important that you get the portability right from the very beginning. Thanks!
Nathan will look into making using blocks optional, providing additional function pointer+context APIs where appropriate and having the common implementation using lambdas.
For the APPLE specific parts it, the only specific darwin-specific part is the part using FSEvents, the other 'if (APPLE)' checks can likely be removed. We would generally need help from people with linux expertise to provide the 'FSEvents' equivalent functionality but this is a small part of the overall feature, it's not important for getting the index-while-building data.

But these things are not part of the current patch, we can discuss again with the follow-up patches that will contain those things.
Post by Eric Liu via Phabricator via cfe-commits
Index-while-build is awesome! I'm looking forward to your patches!
https://reviews.llvm.org/D39050
Eric Liu via Phabricator via cfe-commits
2017-11-16 21:21:43 UTC
Permalink
ioeric added a comment.
Post by Argyrios Kyrtzidis via Phabricator via cfe-commits
Hey Eric,
Post by Eric Liu via Phabricator via cfe-commits
Post by Nathan Hawes via Phabricator via cfe-commits
Post by Eric Liu via Phabricator via cfe-commits
- I think the implementation of the index output logic (e.g. `IndexUnitWriter` and bit format file) can be abstracted away (and split into separate patches) so that you can unit-test the action with a custom/mock unit writer; this would also make the action reusable with (potentially) other storage formats.
The added IndexRecordAction and existing IndexAction use the same functionality from libIndex to collect the indexing data, so I'm not sure mocking the unit writer to unit test IndexRecordAction would add very much value – writing the index data out is the new behavior. The existing tests for IndexAction (under test/Index/Core) are already covering the correctness of the majority of the collected indexing info and the tests coming in the follow-up patch (seen in the original version of this patch) test it's still correct after the write/read round trip.
Thanks for the clarification! I still think it's useful to decouple the IndexAction from the bit format file so that it could be reusable elsewhere. For example, I can see the index action be useful to clangd for building in-memory index.
As Nathan mentioned, we believe the indexing action, as it exists in the trunk, is decoupled enough to be useful, for example Marc was already able to use it and write out the indexing data in a completely different format for his fork of clangd. Of course, we are definitely interested in any additional refactorings that would structure things better and we are eager to see and discuss follow-up patches from anyone that is interested in improving the code, but could we treat this as potential follow-up improvements ?
We are eager to provide the functionality so others can start experimenting with it; I'd propose that we discuss ideas for refactoring of the code as follow-up, what do you think ? Getting the initial functionality in and iterating on it, while getting more experience with applying it on various use-cases, is a common operating mindset of the llvm/clang projects.
To be honest, I want this functionality to get in as much as you do, and I'm more than happy to prioritize the code review for it :) But the current patch size makes the reviewing really hard (e.g. I would never have caught the BLOCK issues hadn't I tried running the original patch myself). I'm not sure if it's really a common practice to check in a big chunk of code without careful code review and leave potential improvements as followups. I'm sure @klimek would have thoughts about this.

If the index action is already flexible enough, would you mind splitting the code for the index action out so that we can start reviewing it? Given that the current patch has very few tests, I guess it wouldn't be too much worse to split out the action without proper test.


https://reviews.llvm.org/D39050
Argyrios Kyrtzidis via Phabricator via cfe-commits
2017-11-16 23:35:32 UTC
Permalink
akyrtzi added a comment.
To be clear, I didn't mean to imply we don't want careful code review, we are really happy for people to point out issues. For example the building problems on linux are serious issues that we will fix and we are grateful for your feedback!
Post by Eric Liu via Phabricator via cfe-commits
If the index action is already flexible enough, would you mind splitting the code for the index action out so that we can start reviewing it? Given that the current patch has very few tests, I guess it wouldn't be too much worse to split out the action without proper test.
To clarify, the index action Nathan and I are referring to, is the indexing action that exists currently in trunk and is the source of the index symbols, feeding index symbols to an abstract `IndexDataConsumer`. See here: https://llvm.org/svn/llvm-project/cfe/trunk/include/clang/Index/IndexingAction.h
This is what Marc used to get the index symbols and store them in his own format. Tests for this functionality are in: https://llvm.org/svn/llvm-project/cfe/trunk/test/Index/Core/


https://reviews.llvm.org/D39050
Eric Liu via Phabricator via cfe-commits
2017-11-17 00:12:48 UTC
Permalink
ioeric added a comment.
Post by Argyrios Kyrtzidis via Phabricator via cfe-commits
Post by Eric Liu via Phabricator via cfe-commits
If the index action is already flexible enough, would you mind splitting the code for the index action out so that we can start reviewing it? Given that the current patch has very few tests, I guess it wouldn't be too much worse to split out the action without proper test.
To clarify, the index action Nathan and I are referring to, is the indexing action that exists currently in trunk and is the source of the index symbols, feeding index symbols to an abstract `IndexDataConsumer`. See here: https://llvm.org/svn/llvm-project/cfe/trunk/include/clang/Index/IndexingAction.h
This is what Marc used to get the index symbols and store them in his own format. Tests for this functionality are in: https://llvm.org/svn/llvm-project/cfe/trunk/test/Index/Core/
Ah, sorry, I was referring to `IndexRecordAction` and its friends (record readers/writers). I didn't notice the newly added index action added and really didn't mean to ask you to refactor the existing code. Apologies for the miscommunication!

What I wanted to proposed is that we could decouple reading/writing of record/unit from the bit file format, so that the record output is not tied to a single output format (e.g. bit format, directory-based) and thus make the compiler more flexible. This might already be the case, but it's not really easy to tell from the current patch...


https://reviews.llvm.org/D39050
Marc-Andre Laperle via Phabricator via cfe-commits
2017-11-20 03:56:49 UTC
Permalink
malaperle added a comment.

Hi! I got a bit further in my experiment in integrating this in Clangd. I put some comments (in the first more complete revision). But since the scope of this patch changed, if you feel like we should take the discussions elsewhere, please let me know! Thanks!



================
Comment at: include/indexstore/IndexStoreCXX.h:84
+
+ std::pair<unsigned, unsigned> getLineCol() {
+ unsigned line, col;
----------------
From what I understand, this returns the beginning of the occurrence. It would be useful to also have the end of the occurrence. From what I tested in Xcode, when you do "Find Selected Symbol in Workspace", it highlights the symbol name in yellow in the list of matches, so it mush use that LineCol then highlight the matching name. This is works in many situations but others occurrences won't have the name of the symbol. For example:
"MyClass o1, o2;"
If I use "Find Selected Symbol in Workspace" on MyClass constructor, if won't be able to highlight o1 and o2
Do you think it would be possible to add that (EndLineCol)? If not, how would one go about extending libindexstore in order to add additional information per occurrence? It is not obvious to me so far.
We also need other things, for example in the case of definitions, we need the full range of the definition so that users can "peek" at definitions in a pop-up. Without storing the end of the definition in the index, we would need to reparse the file.



================
Comment at: include/indexstore/IndexStoreCXX.h:374
+
+ enum class DependencyKind {
+ Unit,
----------------
As part of this dependency tracking mechanism, I haven't found that it could provide information about about the files including a specific header. So given a header (or source file in odd cases), I do not see a way to get all the files that would need to be reindexed if that header changed. Is that something you achieve outside the index? Or perhaps this is something I missed in the code.


================
Comment at: include/indexstore/IndexStoreCXX.h:377
+ Record,
+ File,
+ };
----------------
Could there be a bit of explanation about what's a File dependency versus record and unit? All units and records are file dependencies, right? Are there any files that are neither records or units?


https://reviews.llvm.org/D39050
Nathan Hawes via Phabricator via cfe-commits
2017-11-27 22:41:00 UTC
Permalink
nathawes added a comment.

Thanks for the feedback @malaperle!



================
Comment at: include/indexstore/IndexStoreCXX.h:84
+
+ std::pair<unsigned, unsigned> getLineCol() {
+ unsigned line, col;
----------------
Post by Marc-Andre Laperle via Phabricator via cfe-commits
"MyClass o1, o2;"
If I use "Find Selected Symbol in Workspace" on MyClass constructor, if won't be able to highlight o1 and o2
Do you think it would be possible to add that (EndLineCol)? If not, how would one go about extending libindexstore in order to add additional information per occurrence? It is not obvious to me so far.
We also need other things, for example in the case of definitions, we need the full range of the definition so that users can "peek" at definitions in a pop-up. Without storing the end of the definition in the index, we would need to reparse the file.
Our approach to related locations (e.g. name, signature, body, and doc comment start/end locs) has been to not include them in the index and derive them from the start location later. There's less data to collect and write out during the build that way, and deriving the other locations isn't that costly usually, as in most cases you 1) don't need to type check or even preprocess to get the related locations, as with finding the end of o1 and o2 in your example, and 2) usually only need to derive locations for a single or small number of occurrences, like when 'peeking' at a definition.

Are there cases where you think this approach won't work/perform well enough for the indexer-backed queries clangd needs to support?



================
Comment at: include/indexstore/IndexStoreCXX.h:374
+
+ enum class DependencyKind {
+ Unit,
----------------
Post by Marc-Andre Laperle via Phabricator via cfe-commits
As part of this dependency tracking mechanism, I haven't found that it could provide information about about the files including a specific header. So given a header (or source file in odd cases), I do not see a way to get all the files that would need to be reindexed if that header changed. Is that something you achieve outside the index? Or perhaps this is something I missed in the code.
The unit files store the path of the header/source files they depend on as 'File' dependencies. So any unit file with 'File' dependency on header/source file that was modified may need to be re-indexed.

To support finding which specific files include or are included by a given header (rather than which units somehow transitively include it) we also store the file-to-file inclusions in the unit file (retrieved via IndexUnitReader's foreachInclude method below).


================
Comment at: include/indexstore/IndexStoreCXX.h:377
+ Record,
+ File,
+ };
----------------
Post by Marc-Andre Laperle via Phabricator via cfe-commits
Could there be a bit of explanation about what's a File dependency versus record and unit? All units and records are file dependencies, right? Are there any files that are neither records or units?
I'll rename this to SourceFile and add some comments to explain. Unit file dependencies separate the source dependencies into 'File' dependencies and 'Record' dependencies. The 'File' dependencies track the paths of the header/source files seen in the translation unit, while the 'Record' dependencies track which record files have the symbolic content seen in those source files – the header/source file path doesn't appear anywhere in the record file. This separation lets us depend on a single record file corresponding to multiple source files (e.g. when two source files have the same symbolic content), and on a single source file corresponding multiple record files (e.g. when a single header is included multiple times with different preprocessor contexts changing its symbolic content).


https://reviews.llvm.org/D39050
Eric Liu via Phabricator via cfe-commits
2017-11-28 23:16:48 UTC
Permalink
ioeric added a comment.

First round of comments. Mostly around indexing actions and file records; I haven't started reviewing the data writing and storing code. I think it might make sense to split the index writing and storing logics into a separate patch, which should be possible if `writeUnitData` is abstracted into an interface.



================
Comment at: include/clang/Frontend/CompilerInstance.h:188
+ /// produced to generate imported modules before they are executed.
+ std::function<std::unique_ptr<FrontendAction>
+ (const FrontendOptions &opts, std::unique_ptr<FrontendAction> action)>
----------------
It might make sense to define an alias for `std::function<std::unique_ptr<FrontendAction>(const FrontendOptions &opts, std::unique_ptr<FrontendAction> action)>`, which is used multiple times.


================
Comment at: include/clang/Frontend/FrontendOptions.h:262

+ std::string IndexStorePath;
+ unsigned IndexIgnoreSystemSymbols : 1;
----------------
It might make sense to also have documentations for these options here.


================
Comment at: include/indexstore/IndexStoreCXX.h:84
+
+ std::pair<unsigned, unsigned> getLineCol() {
+ unsigned line, col;
----------------
Post by Nathan Hawes via Phabricator via cfe-commits
Post by Marc-Andre Laperle via Phabricator via cfe-commits
"MyClass o1, o2;"
If I use "Find Selected Symbol in Workspace" on MyClass constructor, if won't be able to highlight o1 and o2
Do you think it would be possible to add that (EndLineCol)? If not, how would one go about extending libindexstore in order to add additional information per occurrence? It is not obvious to me so far.
We also need other things, for example in the case of definitions, we need the full range of the definition so that users can "peek" at definitions in a pop-up. Without storing the end of the definition in the index, we would need to reparse the file.
Our approach to related locations (e.g. name, signature, body, and doc comment start/end locs) has been to not include them in the index and derive them from the start location later. There's less data to collect and write out during the build that way, and deriving the other locations isn't that costly usually, as in most cases you 1) don't need to type check or even preprocess to get the related locations, as with finding the end of o1 and o2 in your example, and 2) usually only need to derive locations for a single or small number of occurrences, like when 'peeking' at a definition.
Are there cases where you think this approach won't work/perform well enough for the indexer-backed queries clangd needs to support?
I agree that we should try to keep the serialized symbol size minimal, but I think it's worthwhile to also store end locations for occurrences because 1) it's cheap, 2) it's not necessary easy to compute without parsing for occurrences like `a::b::c` or `a::b<X>`, 3) it would be useful for many LSP use cases.


================
Comment at: lib/Frontend/CompilerInstance.cpp:1176
+ new GenerateModuleFromModuleMapAction);
+ if (WrapGenModuleAction) {
+ Action = WrapGenModuleAction(FrontendOpts, std::move(Action));
----------------
nit: no braces around one liners.


================
Comment at: lib/FrontendTool/ExecuteCompilerInvocation.cpp:170
+ Act = index::createIndexDataRecordingAction(FEOpts, std::move(Act));
+ CI.setGenModuleActionWrapper(&index::createIndexDataRecordingAction);
+ }
----------------
Could you comment on what this does? The `Act` above is already wrapped. Why do we need `setGenModuleActionWrapper` to `createIndexDataRecordingAction` again? Also, `createIndexDataRecordingAction` doesn't seem related to `GenModule`.


================
Comment at: lib/Index/FileIndexRecord.cpp:23
+ ArrayRef<SymbolRelation> Relations) {
+ assert(D->isCanonicalDecl());
+
----------------
Why?


================
Comment at: lib/Index/FileIndexRecord.cpp:37
+
+ DeclOccurrence NewInfo(Roles, Offset, D, Relations);
+ auto It = std::upper_bound(Decls.begin(), Decls.end(), NewInfo);
----------------
Please comment when this would happen.


================
Comment at: lib/Index/FileIndexRecord.cpp:39
+ auto It = std::upper_bound(Decls.begin(), Decls.end(), NewInfo);
+ Decls.insert(It, std::move(NewInfo));
+}
----------------
Why do we need `Decls` to be sorted by offset? If we want this for printing, it might make sense to just do a sort there.


================
Comment at: lib/Index/FileIndexRecord.h:24
+
+class FileIndexRecord {
+public:
----------------
Please add documentation.


================
Comment at: lib/Index/FileIndexRecord.h:41
+
+ friend bool operator <(const DeclOccurrence &LHS,
+ const DeclOccurrence &RHS) {
----------------
Is this clang-formatted? You might want to run git-clang-format on the whole patch.


================
Comment at: lib/Index/IndexingAction.cpp:284
+ SourceManager &SourceMgr) :
+ IndexCtx(indexCtx), RecordOpts(recordOpts),
+ Includes(IncludesForFile), SourceMgr(SourceMgr) {}
----------------
Again, you don't need the full `IndexingContext` and `RecordOptions` here.


================
Comment at: lib/Index/IndexingAction.cpp:294
+
+ std::pair<FileID, unsigned> LocInfo =
+ SourceMgr.getDecomposedExpansionLoc(From);
----------------
Note that `getDecomposedExpansionLoc ` can also return invalid decomposed loc.


================
Comment at: lib/Index/IndexingAction.cpp:308
+ if (!FE)
+ return;
+ auto lineNo = SourceMgr.getLineNumber(LocInfo.first, LocInfo.second);
----------------
Do we want better error handling here?


================
Comment at: lib/Index/IndexingAction.cpp:327
+
+class IndexDependencyProvider {
+public:
----------------
Please provide documentation.


================
Comment at: lib/Index/IndexingAction.cpp:504
+
+ CreatedASTConsumer = true;
+ std::vector<std::unique_ptr<ASTConsumer>> Consumers;
----------------
Can we get this state from the base class instead of maintaining a another state, which seems to be identical?


================
Comment at: lib/Index/IndexingAction.cpp:524
+ std::string RepositoryPath = getClangRepositoryPath();
+ StringRef BuildNumber = StringRef(RepositoryPath);
+ size_t DashOffset = BuildNumber.find('-');
----------------
Just `StringRef BuildNumber = RepositoryPath;`



================
Comment at: lib/Index/IndexingAction.cpp:701
+namespace {
+class ModuleFileIndexDependencyCollector : public IndexDependencyProvider {
+ serialization::ModuleFile &ModFile;
----------------
Please provide a brief documentation for this class.


================
Comment at: lib/Index/IndexingAction.cpp:703
+ serialization::ModuleFile &ModFile;
+ RecordingOptions RecordOpts;
+
----------------
Again, it doesn't seem necessary for this class to have information about all record options. It seems that you only need `RecordSystemDependencies` here.


================
Comment at: lib/Index/IndexingAction.cpp:713
+ override {
+ auto Reader = CI.getModuleManager();
+ Reader->visitInputFiles(ModFile, RecordOpts.RecordSystemDependencies,
----------------
readability nit: avoid using `auto` if the return type is short to spell but hard to infer from the value expression. Same else where.


================
Comment at: lib/Index/IndexingAction.cpp:762
+ HeaderSearch &HS = CI.getPreprocessor().getHeaderSearchInfo();
+ Module *UnitMod = HS.lookupModule(Mod.ModuleName, /*AllowSearch=*/false);
+
----------------
Could you add a comment explaining why we are not allowing searching.


================
Comment at: lib/Index/IndexingAction.cpp:769
+ IndexCtx.setSysrootPath(SysrootPath);
+ Recorder.init(&IndexCtx, CI);
+
----------------
It's a bit worrying that `IndexDataRecorder` and `IndexContext` reference each other. If you only need some information from the `IndexingContext`, simply pass it into `Recorder`. In this case, I think you only need the `SourceManager` from the `ASTContext` in the recorder to calculate whether a file is a system header. I see you also cache result of `IndexingContext::isSystemFile` in the indexing context, but I think it would be more sensible for the callers to handle caching for this call.


================
Comment at: lib/Index/IndexingAction.cpp:771
+
+ for (const Decl *D : CI.getModuleManager()->getModuleFileLevelDecls(Mod)) {
+ IndexCtx.indexTopLevelDecl(D);
----------------
nit: no braces around one liners.


================
Comment at: lib/Index/IndexingAction.cpp:779
+ Mod.FileName, /*RootFile=*/nullptr, UnitMod, SysrootPath);
+
+}
----------------
nit: redundant empty line


================
Comment at: lib/Index/IndexingAction.cpp:837
+ index::RecordingOptions RecordOpts;
+ std::tie(IndexOpts, RecordOpts) = getIndexOptionsFromFrontendOptions(FEOpts);
+ return ::createIndexDataRecordingAction(IndexOpts, RecordOpts,
----------------
Just `auto pair = getIndexOptionsFromFrontendOptions(FEOpts);` and then use `pair.first` and `pair.second`?

Same below.


================
Comment at: lib/Index/IndexingContext.h:39

class IndexingContext {
IndexingOptions IndexOpts;
----------------
Please define the scope of this class to avoid throwing random states into it, which usually happens to a "context" class.


https://reviews.llvm.org/D39050
Marc-Andre Laperle via Phabricator via cfe-commits
2017-12-07 17:53:19 UTC
Permalink
malaperle added inline comments.


================
Comment at: include/indexstore/IndexStoreCXX.h:84
+
+ std::pair<unsigned, unsigned> getLineCol() {
+ unsigned line, col;
----------------
Post by Eric Liu via Phabricator via cfe-commits
Post by Nathan Hawes via Phabricator via cfe-commits
Post by Marc-Andre Laperle via Phabricator via cfe-commits
"MyClass o1, o2;"
If I use "Find Selected Symbol in Workspace" on MyClass constructor, if won't be able to highlight o1 and o2
Do you think it would be possible to add that (EndLineCol)? If not, how would one go about extending libindexstore in order to add additional information per occurrence? It is not obvious to me so far.
We also need other things, for example in the case of definitions, we need the full range of the definition so that users can "peek" at definitions in a pop-up. Without storing the end of the definition in the index, we would need to reparse the file.
Our approach to related locations (e.g. name, signature, body, and doc comment start/end locs) has been to not include them in the index and derive them from the start location later. There's less data to collect and write out during the build that way, and deriving the other locations isn't that costly usually, as in most cases you 1) don't need to type check or even preprocess to get the related locations, as with finding the end of o1 and o2 in your example, and 2) usually only need to derive locations for a single or small number of occurrences, like when 'peeking' at a definition.
Are there cases where you think this approach won't work/perform well enough for the indexer-backed queries clangd needs to support?
I agree that we should try to keep the serialized symbol size minimal, but I think it's worthwhile to also store end locations for occurrences because 1) it's cheap, 2) it's not necessary easy to compute without parsing for occurrences like `a::b::c` or `a::b<X>`, 3) it would be useful for many LSP use cases.
There's a few reason I think it's better to store the end loc.
- When doing "find references", computing the end loc of each occurrence will be costly. Imagine having thousands of occurrences and for each of them having to run logic to find the end of the occurrence.
- The AST and preprocessor are the best tools I know to figure out the proper end loc. Not using them means having to write a mini-preprocessor with some knowledge about the language semantics to cover some cases.
```
MyClass |o1, o2;
```
Here, I have to stop at the comma. So it's basically take any alpha-numerical character, right?
```
bool operator<(const Foo&, const Foo&)
Ret operator()(Params ...params)
```
No, in those cases, we have to take < and the first ().
- In the case of body start/end locations, similarly, it can be non-trivial.
```
void foo() {
if (0) {
}
}
```
We have to count the balanced { } until we finish the body.

```
#define FUNC_BODY {\
\
}
void foo() FUNC_BODY
```
Oops, where's the body? We need another special logic for this, etc.

I think overall, it puts a lot of burden on the client of libIndexStore, burden that would be much more work and more inaccurate than using the AST/Preprocessor while indexing.


================
Comment at: include/indexstore/IndexStoreCXX.h:374
+
+ enum class DependencyKind {
+ Unit,
----------------
Post by Eric Liu via Phabricator via cfe-commits
Post by Nathan Hawes via Phabricator via cfe-commits
As part of this dependency tracking mechanism, I haven't found that it could provide information about about the files including a specific header. So given a header (or source file in odd cases), I do not see a way to get all the files that would need to be reindexed if that header changed. Is that something you achieve outside the index? Or perhaps this is something I missed in the code.
The unit files store the path of the header/source files they depend on as 'File' dependencies. So any unit file with 'File' dependency on header/source file that was modified may need to be re-indexed.
To support finding which specific files include or are included by a given header (rather than which units somehow transitively include it) we also store the file-to-file inclusions in the unit file (retrieved via IndexUnitReader's foreachInclude method below).
Thanks! I'll play around with this a bit more with this new information.


================
Comment at: include/indexstore/IndexStoreCXX.h:377
+ Record,
+ File,
+ };
----------------
Post by Eric Liu via Phabricator via cfe-commits
Post by Nathan Hawes via Phabricator via cfe-commits
Could there be a bit of explanation about what's a File dependency versus record and unit? All units and records are file dependencies, right? Are there any files that are neither records or units?
I'll rename this to SourceFile and add some comments to explain. Unit file dependencies separate the source dependencies into 'File' dependencies and 'Record' dependencies. The 'File' dependencies track the paths of the header/source files seen in the translation unit, while the 'Record' dependencies track which record files have the symbolic content seen in those source files – the header/source file path doesn't appear anywhere in the record file. This separation lets us depend on a single record file corresponding to multiple source files (e.g. when two source files have the same symbolic content), and on a single source file corresponding multiple record files (e.g. when a single header is included multiple times with different preprocessor contexts changing its symbolic content).
It's more clear now, thanks!


https://reviews.llvm.org/D39050
Argyrios Kyrtzidis via Phabricator via cfe-commits
2017-12-07 18:50:10 UTC
Permalink
akyrtzi added a comment.

@malaperle, to clarify we are not suggesting that you write your own parser, the suggestion is to use clang in 'fast-scan' mode to get the structure of the declarations of a single file, see `CXTranslationUnit_SingleFileParse` (along with enabling skipping of bodies). We have found clang is super fast when you only try to get the structure of a file like this. We can make convenient APIs to provide the syntactic structure of declarations based on their location.

But let's say we added the end-loc, is it enough ? If you want to implement the 'peek the definition' like Eclipse, then it is not enough, you also need to figure out if there are documentation comments associated with the declaration and also show those. Also what if you want to highlight the type signature of a function, then just storing the location of the closing brace of its body is not enough. There can be any arbitrary things you may want to get from the structure of the declaration (e.g. the parameter ranges), but we could provide an API to gather any syntactic structure info you may want.

I would encourage you to try `CXTranslationUnit_SingleFileParse|CXTranslationUnit_SkipFunctionBodies`, you will be pleasantly surprised with how fast this mode is. The `c-index-test` option is `-single-file-parse`.


https://reviews.llvm.org/D39050
Nathan Hawes via Phabricator via cfe-commits
2017-12-07 23:42:22 UTC
Permalink
nathawes added inline comments.


================
Comment at: lib/FrontendTool/ExecuteCompilerInvocation.cpp:170
+ Act = index::createIndexDataRecordingAction(FEOpts, std::move(Act));
+ CI.setGenModuleActionWrapper(&index::createIndexDataRecordingAction);
+ }
----------------
Post by Eric Liu via Phabricator via cfe-commits
Could you comment on what this does? The `Act` above is already wrapped. Why do we need `setGenModuleActionWrapper` to `createIndexDataRecordingAction` again? Also, `createIndexDataRecordingAction` doesn't seem related to `GenModule`.
It's to wrap any GenerateModuleActions that get created as needed when/if Act ends up loading any modules, so that we output index data for them too. I'll add a comment.


================
Comment at: lib/Index/FileIndexRecord.cpp:39
+ auto It = std::upper_bound(Decls.begin(), Decls.end(), NewInfo);
+ Decls.insert(It, std::move(NewInfo));
+}
----------------
Post by Eric Liu via Phabricator via cfe-commits
Why do we need `Decls` to be sorted by offset? If we want this for printing, it might make sense to just do a sort there.
It's mostly for when we hash them, so that ordering doesn't change the hash, but it's also for printing. The IndexASTConsumer doesn't always report symbol occurrences in source order, due to the preprocessor and a few other cases.
We can sort them when the IndexRecordDataConsumer's finish() is called rather than as they're added to avoid the copying from repeated insert calls if that's the concern.


================
Comment at: lib/Index/IndexingAction.cpp:504
+
+ CreatedASTConsumer = true;
+ std::vector<std::unique_ptr<ASTConsumer>> Consumers;
----------------
Post by Eric Liu via Phabricator via cfe-commits
Can we get this state from the base class instead of maintaining a another state, which seems to be identical?
I don't see this state in either base class (WrapperFrontendAction and IndexRecordActionBase). WrappingIndexAction and WrappingIndexRecordAction both have this, though. Were you thinking a new intermediate common base class between them and WrapperFrontendAction?


================
Comment at: lib/Index/IndexingAction.cpp:769
+ IndexCtx.setSysrootPath(SysrootPath);
+ Recorder.init(&IndexCtx, CI);
+
----------------
Post by Eric Liu via Phabricator via cfe-commits
It's a bit worrying that `IndexDataRecorder` and `IndexContext` reference each other. If you only need some information from the `IndexingContext`, simply pass it into `Recorder`. In this case, I think you only need the `SourceManager` from the `ASTContext` in the recorder to calculate whether a file is a system header. I see you also cache result of `IndexingContext::isSystemFile` in the indexing context, but I think it would be more sensible for the callers to handle caching for this call.
Good point. The IndexingContext was actually already calling IsSystemFile before it calls IndexDataRecorder's handleDeclOccurrence and handleModuleOccurrence anyway, so I'll change it to pass that through as an extra param and remove IndexDataRecorder's dependency on the IndexingContext.


https://reviews.llvm.org/D39050
Nathan Hawes via Phabricator via cfe-commits
2017-12-08 00:02:09 UTC
Permalink
nathawes updated this revision to Diff 126065.
nathawes added a comment.
Herald added a subscriber: mgrang.

Worked through the comments from @ioeric and split the code for writing out the collected indexing data into a separate patch.


https://reviews.llvm.org/D39050

Files:
include/clang/Basic/AllDiagnostics.h
include/clang/Basic/CMakeLists.txt
include/clang/Basic/Diagnostic.td
include/clang/Basic/DiagnosticGroups.td
include/clang/Basic/DiagnosticIDs.h
include/clang/Basic/DiagnosticIndexKinds.td
include/clang/Driver/Job.h
include/clang/Driver/Options.td
include/clang/Frontend/CompilerInstance.h
include/clang/Frontend/FrontendOptions.h
include/clang/Index/IndexDataConsumer.h
include/clang/Index/IndexDiagnostic.h
include/clang/Index/IndexingAction.h
include/clang/module.modulemap
lib/Basic/DiagnosticIDs.cpp
lib/Driver/Driver.cpp
lib/Driver/Job.cpp
lib/Driver/ToolChains/Clang.cpp
lib/Driver/ToolChains/Darwin.cpp
lib/Frontend/CompilerInstance.cpp
lib/Frontend/CompilerInvocation.cpp
lib/FrontendTool/CMakeLists.txt
lib/FrontendTool/ExecuteCompilerInvocation.cpp
lib/Index/CMakeLists.txt
lib/Index/FileIndexRecord.cpp
lib/Index/FileIndexRecord.h
lib/Index/IndexingAction.cpp
lib/Index/IndexingContext.cpp
lib/Index/IndexingContext.h
test/Index/Store/assembly-invocation.c
tools/c-index-test/core_main.cpp
tools/diagtool/DiagnosticNames.cpp
tools/libclang/CXIndexDataConsumer.cpp
tools/libclang/CXIndexDataConsumer.h
Marc-Andre Laperle via Phabricator via cfe-commits
2017-12-08 05:35:14 UTC
Permalink
malaperle added a comment.
Post by Argyrios Kyrtzidis via Phabricator via cfe-commits
@malaperle, to clarify we are not suggesting that you write your own parser, the suggestion is to use clang in 'fast-scan' mode to get the structure of the declarations of a single file, see `CXTranslationUnit_SingleFileParse` (along with enabling skipping of bodies). We have found clang is super fast when you only try to get the structure of a file like this.
Thank you, that sounds very useful. I will try that and get some measurements.
Post by Argyrios Kyrtzidis via Phabricator via cfe-commits
We can make convenient APIs to provide the syntactic structure of declarations based on their location.
Perhaps just for the end-loc since it's pretty much guaranteed to be needed by everyone. But if it's very straightforward, perhaps that's not needed. I'll try and see.
Post by Argyrios Kyrtzidis via Phabricator via cfe-commits
But let's say we added the end-loc, is it enough ? If you want to implement the 'peek the definition' like Eclipse, then it is not enough, you also need to figure out if there are documentation comments associated with the declaration and also show those. Also what if you want to highlight the type signature of a function, then just storing the location of the closing brace of its body is not enough. There can be any arbitrary things you may want to get from the structure of the declaration (e.g. the parameter ranges), but we could provide an API to gather any syntactic structure info you may want.
That's a very good point. I guess in the back of my mind, I have the worry that one cannot extend what is stored, either for a different performance trade-off or for additional things. The fact that both clang and clangd have to agree on the format so that index-while-building can be used seems to make it inherently not possible to extend. But perhaps it's better to not overthink this for now.


https://reviews.llvm.org/D39050
Eric Liu via Phabricator via cfe-commits
2017-12-11 22:21:22 UTC
Permalink
ioeric added a comment.

Thanks a lot for the changes! Some more comments inlined.

Please mark addressed comments as done so that reviewers could know what to look :) Thanks!



================
Comment at: include/clang/Frontend/CompilerInstance.h:187
+ typedef std::function<std::unique_ptr<FrontendAction>(
+ const FrontendOptions &opts, std::unique_ptr<FrontendAction> action)>
+ ActionWrapperTy;
----------------
nit: LLVM variable names start with upper-case letters.


================
Comment at: include/clang/Index/IndexingAction.h:34
class IndexDataConsumer;
+ class IndexUnitWriter;

----------------
This should be removed?

Some forward declarations above are not used as well.


================
Comment at: lib/Driver/Job.cpp:293
+ if (HaveCrashVFS) {
+ IndexStoreDir = llvm::sys::path::parent_path(
+ llvm::sys::path::parent_path(CrashInfo->VFSPath));
----------------
Could you share this code with line 278 above, which already has a nice comment?


================
Comment at: lib/Index/FileIndexRecord.cpp:39
+ auto It = std::upper_bound(Decls.begin(), Decls.end(), NewInfo);
+ Decls.insert(It, std::move(NewInfo));
+}
----------------
Post by Nathan Hawes via Phabricator via cfe-commits
Post by Eric Liu via Phabricator via cfe-commits
Why do we need `Decls` to be sorted by offset? If we want this for printing, it might make sense to just do a sort there.
It's mostly for when we hash them, so that ordering doesn't change the hash, but it's also for printing. The IndexASTConsumer doesn't always report symbol occurrences in source order, due to the preprocessor and a few other cases.
We can sort them when the IndexRecordDataConsumer's finish() is called rather than as they're added to avoid the copying from repeated insert calls if that's the concern.
I would leave the sorting to the point where records are hashed to avoid making the record stateful. Consider changing `getDeclOccurrences` to `getOccurrencesSortedByOffset`; this should make the behavior more explicit.


================
Comment at: lib/Index/FileIndexRecord.h:51
+public:
+ FileIndexRecord(FileID FID, bool isSystem) : FID(FID), IsSystem(isSystem) {}
+
----------------
s/isSystem/IsSystem/

Also, I wonder if we can filter out system decls proactively and avoid creating file index record for them. We could also avoid propogating `IsSystem` here.


================
Comment at: lib/Index/IndexingAction.cpp:504
+
+ CreatedASTConsumer = true;
+ std::vector<std::unique_ptr<ASTConsumer>> Consumers;
----------------
Post by Nathan Hawes via Phabricator via cfe-commits
Post by Eric Liu via Phabricator via cfe-commits
Can we get this state from the base class instead of maintaining a another state, which seems to be identical?
I don't see this state in either base class (WrapperFrontendAction and IndexRecordActionBase). WrappingIndexAction and WrappingIndexRecordAction both have this, though. Were you thinking a new intermediate common base class between them and WrapperFrontendAction?
I thought this could be a state in the `WrapperFrontendAction` since both derived classes maintain this state, but after a closer look, this seems to depend on both base classes. I'm not a big fun of maintaining states in multi-stage classes (e.g. `FrontendAction`), which could be confusing and hard to follow; I think `IndexRecordActionBase::finish(...)` should be able to handle the case where no index consumer is created (i.e. no record/dependency/... is collected).

Also, `IndexRecordActionBase` (and the existing `IndexActionBase `) should really be a component instead of a base class since none of its methods is `virtual`.


================
Comment at: lib/Index/IndexingAction.cpp:370
+public:
+ SourceFilesIndexDependencyCollector(IsSystemFileCache &SysrootPath,
+ RecordingOptions recordOpts)
----------------
`IsSystemFileCache &SysrootPath`? What is this parameter?


================
Comment at: lib/Index/IndexingAction.cpp:459
+
+class IndexRecordActionBase {
+protected:
----------------
Please document this class. This can be easily confused with `IndexActionBase` which has a similar name. Same for `IndexAction`/`IndexRecordAction` and `WrappingIndexRecordAction`/`WrappingIndexRecordAction`. I think these pairs share (especially the wrapping actions) some common logics and could probably be merged.


================
Comment at: lib/Index/IndexingAction.cpp:485
+
+ void finish(CompilerInstance &CI);
+};
----------------
This does a lot of stuff... please document the behavior!


================
Comment at: lib/Index/IndexingAction.cpp:577
+ if (!isModuleGeneration &&
+ CI.getFrontendOpts().ProgramAction != frontend::GeneratePCH) {
+ RootFile = SM.getFileEntryForID(SM.getMainFileID());
----------------
nit: no need for braces. Same below.


================
Comment at: lib/Index/IndexingAction.cpp:589
+
+static void writeUnitData(const CompilerInstance &CI,
+ IndexDataRecorder &Recorder,
----------------
In the previous patch, `writeUnitData` does several things including handling modules, dependencies, includes and index records, as well as writing data. It might make sense to add an abstract class (`UnitDataCollector`?) that defines interfaces which make these behavior more explicit. We can then have users pass in an implementation via `createIndexDataRecordingAction` which would also decouple the data collection from data storage in the library.


================
Comment at: lib/Index/IndexingAction.cpp:624
+
+std::unique_ptr<FrontendAction> index::createIndexDataRecordingAction(
+ const FrontendOptions &FEOpts,
----------------
I'm a bit nervous about propagating the entire `FrontendOptions` into the index library. I would simply expose `getIndexOptionsFromFrontendOptions` and have callers parse `FrontendOptions` and pass in only index-related options.


================
Comment at: lib/Index/IndexingContext.h:41
+/// file is considered a system file or not
+class IsSystemFileCache {
+ std::string SysrootPath;
----------------
This name is really confusing... `Is*` is usually used for booleans. Simply call this `SystemFileCache`.


================
Comment at: lib/Index/IndexingContext.h:53
+
+ void setSysrootPath(StringRef path);
+ StringRef getSysrootPath() const { return SysrootPath; }
----------------
How does this affect the existing cached results? Do you need to invalidate them?


================
Comment at: lib/Index/IndexingContext.h:63
IndexDataConsumer &DataConsumer;
+ IsSystemFileCache &IsSystemCache;
ASTContext *Ctx = nullptr;
----------------
I think it would be more straightforward to have context own the cache. If `setSysrootPath` is the problem, it might make sense to propagate it via the context or, if necessary, create a new cache when a new `SysrootPath` is set.


https://reviews.llvm.org/D39050
Nathan Hawes via Phabricator via cfe-commits
2017-12-12 19:51:57 UTC
Permalink
nathawes planned changes to this revision.
nathawes added a comment.

Thanks for taking another look @ioeric – I'll work through your comments and update.


https://reviews.llvm.org/D39050
Nathan Hawes via Phabricator via cfe-commits
2017-12-18 22:05:36 UTC
Permalink
nathawes marked 45 inline comments as done.
nathawes added inline comments.


================
Comment at: lib/Index/FileIndexRecord.h:51
+public:
+ FileIndexRecord(FileID FID, bool isSystem) : FID(FID), IsSystem(isSystem) {}
+
----------------
Post by Eric Liu via Phabricator via cfe-commits
s/isSystem/IsSystem/
Also, I wonder if we can filter out system decls proactively and avoid creating file index record for them. We could also avoid propogating `IsSystem` here.
If the -index-ignore-system-symbols flag is set system decls are filtered out in IndexingContext::handleDeclOccurrence and aren't reported to the IndexDataConsumer, so FileIndexRecords won't be created. The IsSystem here is for clients that want index data for system files, but want to be able to distinguish them from regular files.


https://reviews.llvm.org/D39050
Nathan Hawes via Phabricator via cfe-commits
2017-12-18 22:58:27 UTC
Permalink
nathawes updated this revision to Diff 127412.
nathawes marked an inline comment as done.
nathawes added a comment.

I've refactored the indexing/dependency data collection out from the writing with the new IndexUnitDataConsumer class, and made other smaller changes to address the feedback from @ioeric.


https://reviews.llvm.org/D39050

Files:
include/clang/Basic/AllDiagnostics.h
include/clang/Basic/CMakeLists.txt
include/clang/Basic/Diagnostic.td
include/clang/Basic/DiagnosticGroups.td
include/clang/Basic/DiagnosticIDs.h
include/clang/Basic/DiagnosticIndexKinds.td
include/clang/Driver/Job.h
include/clang/Driver/Options.td
include/clang/Frontend/CompilerInstance.h
include/clang/Frontend/FrontendOptions.h
include/clang/Index/DeclOccurrence.h
include/clang/Index/IndexDataConsumer.h
include/clang/Index/IndexDiagnostic.h
include/clang/Index/IndexUnitDataConsumer.h
include/clang/Index/IndexingAction.h
include/clang/module.modulemap
lib/Basic/DiagnosticIDs.cpp
lib/Driver/Driver.cpp
lib/Driver/Job.cpp
lib/Driver/ToolChains/Clang.cpp
lib/Driver/ToolChains/Darwin.cpp
lib/Frontend/CompilerInstance.cpp
lib/Frontend/CompilerInvocation.cpp
lib/FrontendTool/CMakeLists.txt
lib/FrontendTool/ExecuteCompilerInvocation.cpp
lib/Index/CMakeLists.txt
lib/Index/FileIndexData.cpp
lib/Index/FileIndexData.h
lib/Index/IndexingAction.cpp
lib/Index/IndexingContext.cpp
lib/Index/IndexingContext.h
test/Index/Store/assembly-invocation.c
tools/c-index-test/core_main.cpp
tools/diagtool/DiagnosticNames.cpp
tools/libclang/CXIndexDataConsumer.cpp
tools/libclang/CXIndexDataConsumer.h
Nathan Hawes via Phabricator via cfe-commits
2017-12-19 18:50:41 UTC
Permalink
nathawes updated this revision to Diff 127568.
nathawes added a comment.

Fix out of data header comment in FileIndexData.h


https://reviews.llvm.org/D39050

Files:
include/clang/Basic/AllDiagnostics.h
include/clang/Basic/CMakeLists.txt
include/clang/Basic/Diagnostic.td
include/clang/Basic/DiagnosticGroups.td
include/clang/Basic/DiagnosticIDs.h
include/clang/Basic/DiagnosticIndexKinds.td
include/clang/Driver/Job.h
include/clang/Driver/Options.td
include/clang/Frontend/CompilerInstance.h
include/clang/Frontend/FrontendOptions.h
include/clang/Index/DeclOccurrence.h
include/clang/Index/IndexDataConsumer.h
include/clang/Index/IndexDiagnostic.h
include/clang/Index/IndexUnitDataConsumer.h
include/clang/Index/IndexingAction.h
include/clang/module.modulemap
lib/Basic/DiagnosticIDs.cpp
lib/Driver/Driver.cpp
lib/Driver/Job.cpp
lib/Driver/ToolChains/Clang.cpp
lib/Driver/ToolChains/Darwin.cpp
lib/Frontend/CompilerInstance.cpp
lib/Frontend/CompilerInvocation.cpp
lib/FrontendTool/CMakeLists.txt
lib/FrontendTool/ExecuteCompilerInvocation.cpp
lib/Index/CMakeLists.txt
lib/Index/FileIndexData.cpp
lib/Index/FileIndexData.h
lib/Index/IndexingAction.cpp
lib/Index/IndexingContext.cpp
lib/Index/IndexingContext.h
test/Index/Store/assembly-invocation.c
tools/c-index-test/core_main.cpp
tools/diagtool/DiagnosticNames.cpp
tools/libclang/CXIndexDataConsumer.cpp
tools/libclang/CXIndexDataConsumer.h
Eric Liu via Phabricator via cfe-commits
2017-12-19 23:01:28 UTC
Permalink
ioeric added a comment.

Thanks a lot for further cleaning up the patch! It is now much easier to review. I really appreciate it!

Some more comments on the public APIs and the layering of classes. There are a lot of helper classes in the implementation, so I think it's important to get a clear layering so that they could be easily understood by future contributors :)

Also, with the `IndexUnitDataConsumer` abstraction, it seems to be feasible now to add some unit tests for `createUnitIndexingAction`. With all the recent major changes, I think it's important that we have some degree of testing to make sure components actually work together.



================
Comment at: include/clang/Frontend/CompilerInstance.h:187
+ typedef std::function<std::unique_ptr<FrontendAction>(
+ const FrontendOptions &opts, std::unique_ptr<FrontendAction> action)>
+ ActionWrapperTy;
----------------
Post by Eric Liu via Phabricator via cfe-commits
nit: LLVM variable names start with upper-case letters.
`opts` and `action` are still lower-case.


================
Comment at: include/clang/Index/DeclOccurrence.h:38
+ }
+ };
+
----------------
Nit: indentation.

Tip: `git-clang-format` against the diff base can format all changed lines in your patch.


================
Comment at: include/clang/Index/IndexUnitDataConsumer.h:1
+//===--- IndexUnitDataConsumer.h - Abstract index unit data consumer ---------------===//
+//
----------------
IIUC, this is the index data for a translation unit, as opposed to an AST. If so, consider calling this `UnitIndexDataConsumer` to match `(AST)IndexDataConsumer` which is parallel to this. We might want to rename them to be either `index::UnitDataConsumer` vs `index::ASTDataConsumer` or `index::UnitIndexDataConsumer` vs `index::ASTIndexDataConsumer` . I am inclined to the first pair as `index` is already implied in the namespace.


================
Comment at: include/clang/Index/IndexUnitDataConsumer.h:67
+private:
+ virtual void _anchor();
+};
----------------
Comment? Why do we actually need this?


================
Comment at: include/clang/Index/IndexingAction.h:49

+struct UnitIndexingOptions : IndexingOptions {
+ enum class FileIncludeFilterKind {
----------------
We are now mixing functionalities for Unit indexing and AST indexing actions in the same file. We might want to separate these into two headers e..g `UnitIndexingAction.h` and `ASTIndexingAction.h`. This would make it easier for users to find the right functions :)


================
Comment at: include/clang/Index/IndexingAction.h:66
+/// Information on the translation unit
+struct UnitDetails {
+ Module *UnitModule;
----------------
Please add documentation for each field. It's not trivial what each field is for, especially some fields seem to be optional and some seem to be mutually exclusive.


================
Comment at: include/clang/Index/IndexingAction.h:67
+struct UnitDetails {
+ Module *UnitModule;
+ std::string ModuleName;
----------------
These pointers suggest the life time of this struct is tied to some other struct, which makes the struct look a bit dangerous to use. Should we also carry a reference or a smart pointer to the underlying object that keeps these pointers valid? Would it be a `CompilerInstance ` (guessing from `IndexUnitDataConsumerFactory `)?


================
Comment at: include/clang/Index/IndexingAction.h:114
+std::unique_ptr<FrontendAction>
+createUnitIndexingAction(const UnitIndexingOptions &IndexOpts,
+ IndexUnitDataConsumerFactory ConsumerFactory,
----------------
What is the intended user of this function? It's unclear how users could obtain a `ConsumerFactory ` (i.e. `UnitDetails`) without the functionalities in `UnitDataConsumerActionImpl `.

(Also see comment in the implementation of `createIndexDataRecordingAction`.)


================
Comment at: include/clang/Index/IndexingAction.h:128
+RecordingOptions
+getRecordingOptionsFromFrontendOptions(const FrontendOptions &FEOpts);
+
----------------
This is likely only useful for compiler invocation. I would put it in the compiler invocation code.


================
Comment at: lib/Driver/Job.cpp:211

+/// The leftover modules from the crash are stored in
+/// <name>.cache/vfs/modules
----------------
nit: Comment should start with an overview of what the function does.

```
Returns a directory path that is ...
```

Also, consider calling this `getDirAdjacentToModCache`. `buildDir` can be ambiguous.


================
Comment at: lib/Driver/Job.cpp:220
+ llvm::SmallString<128> RelModCacheDir = llvm::sys::path::parent_path(
+ llvm::sys::path::parent_path(CrashInfo->VFSPath));
+ llvm::sys::path::append(RelModCacheDir, DirName);
----------------
Please clang-format the code. Without indentation, this looks like an no-op statement.


================
Comment at: lib/Index/IndexingAction.cpp:88
+/// \c WrappingIndexAction frontend actions.
+struct IndexActionImpl {
+ virtual ~IndexActionImpl() = default;
----------------
Use `class` for interfaces.


================
Comment at: lib/Index/IndexingAction.cpp:99
+ /// Callback at the end of processing a single input.
+ virtual void finish(CompilerInstance &CI) = 0;
};
----------------
Does `CI` here have to be the same instance as the one in `createIndexASTConsumer `? Might worth documenting.


================
Comment at: lib/Index/IndexingAction.cpp:137
+
+ CreatedASTConsumer = true;
+ std::vector<std::unique_ptr<ASTConsumer>> Consumers;
----------------
nit: Move this after `Impl->createIndexASTConsumer(CI)`.

Do we need to reset this flag? Calling `CreateASTConsumer ` multiple times on the same instance seems to be allowed?


================
Comment at: lib/Index/IndexingAction.cpp:243
+/// Collects and groups consumed index data by \c FileID.
+class IndexDataCollector : public IndexDataConsumer {
+ Preprocessor *PP = nullptr;
----------------
This seems to be related to files. Maybe `FileIndexDataCollector`?


================
Comment at: lib/Index/IndexingAction.cpp:250
+public:
+ void setPreprocessor(Preprocessor &PreProc) {
+ PP = &PreProc;
----------------
`override`


================
Comment at: lib/Index/IndexingAction.cpp:254
+
+ IndexDataByFileTy::const_iterator by_file_begin() const {
+ return IndexDataByFile.begin();
----------------
Simply `begin`, if the class is called `FileIndexDataCollector `.

Similar below to match iterator naming convention.


================
Comment at: lib/Index/IndexingAction.cpp:265
+private:
+ bool handleDeclOccurence(const Decl *D, SymbolRoleSet Roles,
+ ArrayRef<SymbolRelation> Relations, FileID FID,
----------------
I think this should be `public` as this is still implementing `IndexDataConsumer`.


================
Comment at: lib/Index/IndexingAction.cpp:323
+ llvm_unreachable("should have already checked in the beginning");
+ case UnitIndexingOptions::FileIncludeFilterKind::UserOnly:
+ if (SystemCache.isSystem(LocInfo.first, SourceMgr))
----------------
I'd simply do:
```
if FileIncludeFilter == UnitIndexingOptions::FileIncludeFilterKind::UserOnly)
if (isSystem...)
return;
```


================
Comment at: lib/Index/IndexingAction.cpp:337
+
+ virtual void InclusionDirective(SourceLocation HashLoc,
+ const Token &IncludeTok, StringRef FileName,
----------------
Same here. This should be `public`


================
Comment at: lib/Index/IndexingAction.cpp:354
+
+ virtual void visitFileDependencies(
+ const CompilerInstance &CI,
----------------
The naming convention for the callback interfaces is `forEach*` e.g. `forEachFileDependency`.

s/visitor/Callback/
(same below).


================
Comment at: lib/Index/IndexingAction.cpp:358
+ virtual void
+ visitIncludes(llvm::function_ref<void(const FileEntry *Source, unsigned Line,
+ const FileEntry *Target)>
----------------
`forEachInclude`


================
Comment at: lib/Index/IndexingAction.cpp:361
+ visitor) const = 0;
+ virtual void visitModuleImports(
+ const CompilerInstance &CI,
----------------
`forEachModuleImport`


================
Comment at: lib/Index/IndexingAction.cpp:369
+/// file to file inclusions, for the source files in a translation unit
+class SourceFilesIndexDependencyCollector : public DependencyCollector,
+ public IndexDependencyProvider {
----------------
This is two classes in one, which is difficult to understand. Could you split it into `FileIndexDependencyCollector ` and `FileIndexDependencyProvider` and have `FileIndexDependencyCollector` returns a provider on finish (e.g. `Provider consume();`; you might want to copy/move the collected data into the provider). It would be easier to justify the behavior (e.g. what happens when you access the provider while collector is still working?)


================
Comment at: lib/Index/IndexingAction.cpp:373
+ UnitIndexingOptions IndexOpts;
+ llvm::SetVector<const FileEntry *> Entries;
+ llvm::BitVector IsSystemByUID;
----------------
What does `Entries` contain? What files are added?


================
Comment at: lib/Index/IndexingAction.cpp:501
+ /// IndexUnitDataConsumer constructed from the \c UnitConsumerFactory if the
+ /// \c ParentUnitConsumer indicates \c Mod should be indexed.
+ ///
----------------
Instead of passing `ParentUnitConsumer`, consider checking the `Mod` before calling the function.


================
Comment at: lib/Index/IndexingAction.cpp:504
+ /// \returns true if \c Mod was indexed
+ static bool indexModule(
+ const CompilerInstance &CI, serialization::ModuleFile &Mod,
----------------
Non-factory static method is often a code smell. Any reason not to make these static methods private members? With that, you wouldn't need to pass along so many parameters. You could make them `const` if you don't want members to be modified.


================
Comment at: lib/Index/IndexingAction.cpp:511
+ /// Get unit details for the given module file
+ static UnitDetails getUnitDetails(serialization::ModuleFile &Mod,
+ const CompilerInstance &CI,
----------------
Why is this overload public while others are private? Aren't they all used only in this class?


================
Comment at: lib/Index/IndexingAction.cpp:531
+};
+} // anonymous namespace
+
----------------
Any reason to close the anonymous namespace here? Shouldn't outlined definitions of `UnitDataConsumerActionImpl`'s methods also in the anonymous namespace?


================
Comment at: lib/Index/IndexingAction.cpp:758
+
+ class IndexUnitDataRecorder : public IndexUnitDataConsumer {
+ public:
----------------
I think the inheritance of `IndexUnitDataConsumer` and the creation of factory should be in user code (e.g. implementation for on-disk persist-index-data should come from the compiler invocation code `ExecuteCompilerInvocation.cpp` or at least a separate file in the library that compiler invocation can use), and the user should only use `createUnitIndexingAction` by providing a factory. Currently, `createUnitIndexingAction` and `createIndexDataRecordingAction` are mostly identical except for the code that implements `IndexUnitDataConsumer ` and creates the factory.

The current `createIndexDataRecordingAction` would probably only used by the compiler invocation, and we can keep the generalized `createUnitIndexingAction` in the public APIs.


================
Comment at: lib/Index/IndexingAction.cpp:765
+ auto ConsumerFactory =
+ [&](const CompilerInstance &CI, UnitDetails UnitInfo) ->
+ std::unique_ptr<IndexUnitDataConsumer> {
----------------
The `UnitInfo` is ignored? What do we actually need it for?


================
Comment at: lib/Index/IndexingAction.cpp:769
+ };
+ auto Base = llvm::make_unique<UnitDataConsumerActionImpl>(
+ std::move(RecordOpts), ConsumerFactory);
----------------
`Base` doesn't seem to be a very meaningful name here.


https://reviews.llvm.org/D39050
Eric Liu via Phabricator via cfe-commits
2017-12-19 23:08:23 UTC
Permalink
ioeric added inline comments.


================
Comment at: include/clang/Index/IndexUnitDataConsumer.h:1
+//===--- IndexUnitDataConsumer.h - Abstract index unit data consumer ---------------===//
+//
----------------
Post by Eric Liu via Phabricator via cfe-commits
IIUC, this is the index data for a translation unit, as opposed to an AST. If so, consider calling this `UnitIndexDataConsumer` to match `(AST)IndexDataConsumer` which is parallel to this. We might want to rename them to be either `index::UnitDataConsumer` vs `index::ASTDataConsumer` or `index::UnitIndexDataConsumer` vs `index::ASTIndexDataConsumer` . I am inclined to the first pair as `index` is already implied in the namespace.
Sorry, asking you to also rename `IndexDataConsumer` is probably too much and out of the scope of this patch. I'm fine with `UnitIndexDataConsumer` or `UnitDataConsumer` or something similar for now without touching `IndexDataConsumer` :)


https://reviews.llvm.org/D39050
Eric Liu via Phabricator via cfe-commits
2018-01-03 13:54:17 UTC
Permalink
ioeric requested changes to this revision.
ioeric added a comment.
This revision now requires changes to proceed.

(I think I forgot to update the patch status :)


https://reviews.llvm.org/D39050
Nathan Hawes via Phabricator via cfe-commits
2018-01-18 22:36:41 UTC
Permalink
nathawes planned changes to this revision.
nathawes marked 32 inline comments as done.
nathawes added a comment.

@ioeric I should have an updated patch up shortly with your inline comments addressed + new tests. Thanks again for reviewing!



================
Comment at: include/clang/Index/IndexUnitDataConsumer.h:67
+private:
+ virtual void _anchor();
+};
----------------
Post by Eric Liu via Phabricator via cfe-commits
Comment? Why do we actually need this?
From [[ https://stackoverflow.com/questions/34913197/what-is-vtable-anchoring-and-how-does-it-work-in-a-shared-object | here ]], my understanding is that it's an optimization to avoid the vtable being included in multiple translation units. I'm not sure if that's actually a problem, I was just following IndexDataConsumer's lead. Added a comment.


================
Comment at: include/clang/Index/IndexingAction.h:114
+std::unique_ptr<FrontendAction>
+createUnitIndexingAction(const UnitIndexingOptions &IndexOpts,
+ IndexUnitDataConsumerFactory ConsumerFactory,
----------------
Post by Eric Liu via Phabricator via cfe-commits
What is the intended user of this function? It's unclear how users could obtain a `ConsumerFactory ` (i.e. `UnitDetails`) without the functionalities in `UnitDataConsumerActionImpl `.
(Also see comment in the implementation of `createIndexDataRecordingAction`.)
Sorry, I'm not sure what you mean here. Users shouldn't need to know anything about `UnitDataConsumerActionImpl`, they just need to provide a lambda/function reference that takes a `CompilerInstance&` and a `UnitDetails` and returns an `IndexUnitDataConsumer` (or `UnitIndexDataConsumer` once I rename it). This gets called once per translation unit to get a distinct data consumer for each unit, i.e. for the main translation unit as well as for each of its dependent modules that the main unit's data consumer says should be indexed via `shouldIndexModuleDependency(...)`.


================
Comment at: include/clang/Index/IndexingAction.h:128
+RecordingOptions
+getRecordingOptionsFromFrontendOptions(const FrontendOptions &FEOpts);
+
----------------
Post by Eric Liu via Phabricator via cfe-commits
This is likely only useful for compiler invocation. I would put it in the compiler invocation code.
There's another public `index::` API for writing out index data for individual clang module files in the follow up patch that takes a `RecordingOptions` and is used externally, from Swift. This function's useful on the Swift side to get the `RecordingOptions` from `FrontendOptions` it has already set up.


================
Comment at: lib/Index/IndexingAction.cpp:137
+
+ CreatedASTConsumer = true;
+ std::vector<std::unique_ptr<ASTConsumer>> Consumers;
----------------
Post by Eric Liu via Phabricator via cfe-commits
nit: Move this after `Impl->createIndexASTConsumer(CI)`.
Do we need to reset this flag? Calling `CreateASTConsumer ` multiple times on the same instance seems to be allowed?
Oops. Yes, we do :-)


================
Comment at: lib/Index/IndexingAction.cpp:504
+ /// \returns true if \c Mod was indexed
+ static bool indexModule(
+ const CompilerInstance &CI, serialization::ModuleFile &Mod,
----------------
Post by Eric Liu via Phabricator via cfe-commits
Non-factory static method is often a code smell. Any reason not to make these static methods private members? With that, you wouldn't need to pass along so many parameters. You could make them `const` if you don't want members to be modified.
Sorry, there's missing context – they're used from another public API that's in the follow-up patch. I'll bring that over and make these top-level static functions, since they don't belong exclusively to IndexDataConsumerActionImpl.


================
Comment at: lib/Index/IndexingAction.cpp:511
+ /// Get unit details for the given module file
+ static UnitDetails getUnitDetails(serialization::ModuleFile &Mod,
+ const CompilerInstance &CI,
----------------
Post by Eric Liu via Phabricator via cfe-commits
Why is this overload public while others are private? Aren't they all used only in this class?
Same as above – this is called from a public `index::` API in the follow-up patch.


================
Comment at: lib/Index/IndexingAction.cpp:758
+
+ class IndexUnitDataRecorder : public IndexUnitDataConsumer {
+ public:
----------------
Post by Eric Liu via Phabricator via cfe-commits
I think the inheritance of `IndexUnitDataConsumer` and the creation of factory should be in user code (e.g. implementation for on-disk persist-index-data should come from the compiler invocation code `ExecuteCompilerInvocation.cpp` or at least a separate file in the library that compiler invocation can use), and the user should only use `createUnitIndexingAction` by providing a factory. Currently, `createUnitIndexingAction` and `createIndexDataRecordingAction` are mostly identical except for the code that implements `IndexUnitDataConsumer ` and creates the factory.
The current `createIndexDataRecordingAction` would probably only used by the compiler invocation, and we can keep the generalized `createUnitIndexingAction` in the public APIs.
`IndexUnitDataRecorder` here is just a stub I added when I split the patch up – the follow-up revision has it in a separate file. I'll move the separate files to this patch and stub out the method bodies with TODOs instead.

I've made `createIndexDataRecordingAction` call `createUnitIndexingAction` to remove the duplication, and pulled it, `RecordingOptions` and `getRecordingOptionsFromFrontendOptions` to a new header (`RecordingAction.h`) that `ExecuteComilerInvocation.cpp` uses. Does that sound ok?


================
Comment at: lib/Index/IndexingAction.cpp:765
+ auto ConsumerFactory =
+ [&](const CompilerInstance &CI, UnitDetails UnitInfo) ->
+ std::unique_ptr<IndexUnitDataConsumer> {
----------------
Post by Eric Liu via Phabricator via cfe-commits
The `UnitInfo` is ignored? What do we actually need it for?
It should be passed to IndexUnitDataRecorder to write out info about the unit itself. This was just me splitting the patch badly.


https://reviews.llvm.org/D39050
Nathan Hawes via Phabricator via cfe-commits
2018-01-18 23:03:12 UTC
Permalink
nathawes updated this revision to Diff 130496.
nathawes marked 6 inline comments as done.
nathawes added a comment.

- Applied the various refactorings suggested by @ioeric
- Extended c-index-test with a new option to print out the collected unit indexing data, and
- Added tests for the unit indexing functionality using the new option
- Fixed formatting


https://reviews.llvm.org/D39050

Files:
include/clang/Basic/AllDiagnostics.h
include/clang/Basic/CMakeLists.txt
include/clang/Basic/Diagnostic.td
include/clang/Basic/DiagnosticGroups.td
include/clang/Basic/DiagnosticIDs.h
include/clang/Basic/DiagnosticIndexKinds.td
include/clang/Driver/Job.h
include/clang/Driver/Options.td
include/clang/Frontend/CompilerInstance.h
include/clang/Frontend/FrontendOptions.h
include/clang/Index/DeclOccurrence.h
include/clang/Index/IndexDataConsumer.h
include/clang/Index/IndexDiagnostic.h
include/clang/Index/IndexingAction.h
include/clang/Index/RecordingAction.h
include/clang/Index/UnitIndexDataConsumer.h
include/clang/Index/UnitIndexingAction.h
include/clang/module.modulemap
lib/Basic/DiagnosticIDs.cpp
lib/Driver/Driver.cpp
lib/Driver/Job.cpp
lib/Driver/ToolChains/Clang.cpp
lib/Driver/ToolChains/Darwin.cpp
lib/Frontend/CompilerInstance.cpp
lib/Frontend/CompilerInvocation.cpp
lib/FrontendTool/CMakeLists.txt
lib/FrontendTool/ExecuteCompilerInvocation.cpp
lib/Index/CMakeLists.txt
lib/Index/FileIndexData.cpp
lib/Index/FileIndexData.h
lib/Index/IndexingAction.cpp
lib/Index/IndexingContext.cpp
lib/Index/IndexingContext.h
lib/Index/UnitIndexDataRecorder.cpp
lib/Index/UnitIndexDataRecorder.h
test/Index/Core/Inputs/module/ModDep.h
test/Index/Core/Inputs/module/ModSystem.h
test/Index/Core/Inputs/module/ModTop.h
test/Index/Core/Inputs/module/ModTopSub1.h
test/Index/Core/Inputs/module/ModTopSub2.h
test/Index/Core/Inputs/module/module.modulemap
test/Index/Core/Inputs/sys/system-head.h
test/Index/Core/Inputs/transitive-include.h
test/Index/Core/external-source-symbol-attr.m
test/Index/Core/index-instantiated-source.cpp
test/Index/Core/index-source.mm
test/Index/Core/index-subkinds.m
test/Index/Core/index-system.mm
test/Index/Core/index-unit.mm
test/Index/Store/assembly-invocation.c
tools/c-index-test/core_main.cpp
tools/diagtool/DiagnosticNames.cpp
tools/libclang/CXIndexDataConsumer.cpp
tools/libclang/CXIndexDataConsumer.h
Eric Liu via Phabricator via cfe-commits
2018-01-19 12:10:49 UTC
Permalink
ioeric accepted this revision.
ioeric added a comment.
This revision is now accepted and ready to land.

Nice! Thanks for making the refactoring and adding tests! I think this is good to go now.

I'm not very familiar with code outside of the index library (Driver, Basic etc), but the changes seem reasonable to me. Feel free to get another pair of eyes for them ;)



================
Comment at: include/clang/Index/RecordingAction.h:42
+std::unique_ptr<FrontendAction>
+createIndexDataRecordingAction(RecordingOptions RecordOpts,
+ std::unique_ptr<FrontendAction> WrappedAction);
----------------
Add a FIXME that this is not implemented yet.


================
Comment at: lib/Index/IndexingAction.cpp:758
+
+ class IndexUnitDataRecorder : public IndexUnitDataConsumer {
+ public:
----------------
Post by Nathan Hawes via Phabricator via cfe-commits
Post by Eric Liu via Phabricator via cfe-commits
I think the inheritance of `IndexUnitDataConsumer` and the creation of factory should be in user code (e.g. implementation for on-disk persist-index-data should come from the compiler invocation code `ExecuteCompilerInvocation.cpp` or at least a separate file in the library that compiler invocation can use), and the user should only use `createUnitIndexingAction` by providing a factory. Currently, `createUnitIndexingAction` and `createIndexDataRecordingAction` are mostly identical except for the code that implements `IndexUnitDataConsumer ` and creates the factory.
The current `createIndexDataRecordingAction` would probably only used by the compiler invocation, and we can keep the generalized `createUnitIndexingAction` in the public APIs.
`IndexUnitDataRecorder` here is just a stub I added when I split the patch up – the follow-up revision has it in a separate file. I'll move the separate files to this patch and stub out the method bodies with TODOs instead.
I've made `createIndexDataRecordingAction` call `createUnitIndexingAction` to remove the duplication, and pulled it, `RecordingOptions` and `getRecordingOptionsFromFrontendOptions` to a new header (`RecordingAction.h`) that `ExecuteComilerInvocation.cpp` uses. Does that sound ok?
Sounds good. Thanks for the explanation!


https://reviews.llvm.org/D39050
Eric Liu via Phabricator via cfe-commits
2018-02-12 13:52:21 UTC
Permalink
ioeric added a comment.

I'm wondering if there is any further plan for this? ;)


https://reviews.llvm.org/D39050
Marc-Andre Laperle via Phabricator via cfe-commits
2018-02-12 16:43:55 UTC
Permalink
malaperle added a comment.
Post by Eric Liu via Phabricator via cfe-commits
I'm wondering if there is any further plan for this? ;)
I'd like to comment on the amount of data that will be stored but that can be done outside this review. I still have a few things to figure out before reaching a conclusion.


https://reviews.llvm.org/D39050
Nathan Hawes via Phabricator via cfe-commits
2018-02-13 03:17:31 UTC
Permalink
nathawes added a comment.

@ioeric I'm working on a few other priorities over the next few weeks, sorry, but should get back to this relatively soon after that.
I would just land it, but I expect some downstream breakage I want to make sure I have time to fix.

@malaperle Sounds good – I'll keep an eye out for it!


https://reviews.llvm.org/D39050
Marc-Andre Laperle via Phabricator via cfe-commits
2018-02-27 21:38:07 UTC
Permalink
malaperle added a comment.
Post by Marc-Andre Laperle via Phabricator via cfe-commits
Post by Argyrios Kyrtzidis via Phabricator via cfe-commits
@malaperle, to clarify we are not suggesting that you write your own parser, the suggestion is to use clang in 'fast-scan' mode to get the structure of the declarations of a single file, see `CXTranslationUnit_SingleFileParse` (along with enabling skipping of bodies). We have found clang is super fast when you only try to get the structure of a file like this.
Thank you, that sounds very useful. I will try that and get some measurements.
Post by Argyrios Kyrtzidis via Phabricator via cfe-commits
We can make convenient APIs to provide the syntactic structure of declarations based on their location.
Perhaps just for the end-loc since it's pretty much guaranteed to be needed by everyone. But if it's very straightforward, perhaps that's not needed. I'll try and see.
Post by Argyrios Kyrtzidis via Phabricator via cfe-commits
But let's say we added the end-loc, is it enough ? If you want to implement the 'peek the definition' like Eclipse, then it is not enough, you also need to figure out if there are documentation comments associated with the declaration and also show those. Also what if you want to highlight the type signature of a function, then just storing the location of the closing brace of its body is not enough. There can be any arbitrary things you may want to get from the structure of the declaration (e.g. the parameter ranges), but we could provide an API to gather any syntactic structure info you may want.
That's a very good point. I guess in the back of my mind, I have the worry that one cannot extend what is stored, either for a different performance trade-off or for additional things. The fact that both clang and clangd have to agree on the format so that index-while-building can be used seems to make it inherently not possible to extend. But perhaps it's better to not overthink this for now.
I did a bit more of experimenting. For the end-loc, I changed my prototype so that the end-loc is not stored in the index but rather computed "on the fly" using SourceManager and Lexer only. For my little benchmark, I used the LLVM/Clang/Clangd code base which I queried for all references of "std" (the namespace) which is around 46K references in the index.

With end-loc in index: 3.45s on average (20 samples)
With end-loc computed on the fly: 11.33s on average (20 samples)
I also tried with Xcode but without too much success: it took about 30 secs to reach 45K results and then carried on for a long time and hung (although I didn't try to leave it for hours to see if it finished).

From my perspective, it seems that the extra time is quite substantial and it doesn't seem worth to save an integer per occurrence in this case.

For computing the start/end-loc of function bodies, I tried the SingleFileParseMode and SkipFunctionBodies separately ( as a start). The source I use this on looks like this:

#include "MyClass.h"

MyClass::MyClass() {
}

void MyClass::doOperation() {
}
Post by Marc-Andre Laperle via Phabricator via cfe-commits
MyClass.cpp:5:1: error: use of undeclared identifier 'MyClass'
MyClass.cpp:8:6: error: use of undeclared identifier 'MyClass'
Then I cannot obtain any Decl* at the position of doOperation. With SingleFileParseMode, I'm also a bit weary that not processing headers will result in many inaccuracies. From our perspective, we are more wiling to sacrifice disk space in order to have more accuracy and speed. For comparison, the index I worked with containing all end-loc for occurrences and also function start/end is 201M for LLVM/Clang/Clangd which is small to us.

With SkipFunctionBodies alone, I can get the Decl* but FunctionDecl::getSourceRange() doesn't include the body, rather, it stops after the arguments.
It would be very nice if we could do this cheaply but it doesn't seem possible with those two flags alone. What did you have in mind for implementing an "API to gather any syntactic structure info" ?


https://reviews.llvm.org/D39050
Marc-Andre Laperle via Phabricator via cfe-commits
2018-03-13 18:21:28 UTC
Permalink
malaperle added a comment.
Given the discussion in https://reviews.llvm.org/D44247, I think we can do without the start/end-loc of function bodies and try some heuristics client-side. We can always revisit this later if necessary.

However, for the end-loc of occurrences, would you be OK with this being added? I think it would be a good compromise in terms of performance, simplicity and index size.


https://reviews.llvm.org/D39050
Nathan Hawes via Phabricator via cfe-commits
2018-03-13 20:43:00 UTC
Permalink
nathawes added a comment.
Post by Marc-Andre Laperle via Phabricator via cfe-commits
Given the discussion in https://reviews.llvm.org/D44247, I think we can do without the start/end-loc of function bodies and try some heuristics client-side. We can always revisit this later if necessary.
However, for the end-loc of occurrences, would you be OK with this being added? I think it would be a good compromise in terms of performance, simplicity and index size.
@malaperle Just to clarify, what's the particular end-loc we're talking about here? e.g. for a function call, would this be the end of the function's name, or the closing paren?
For the end of the name, couldn't this be derived from the start loc + symbol name length (barring token pastes and escaped new lines in the middle of identifiers, which hopefully aren't too common)?
I can see the value for the closing paren though.

@akyrtzi Are the numbers from Marc-Andre's experiment what you'd expect to see and is there anything else to try? I'm not familiar with those modes at all to comment, sorry. I assume any API to gather syntactic structure info would be based on those modes, right?


https://reviews.llvm.org/D39050
Marc-Andre Laperle via Phabricator via cfe-commits
2018-03-14 14:44:32 UTC
Permalink
malaperle added a comment.
Post by Nathan Hawes via Phabricator via cfe-commits
@malaperle Just to clarify, what's the particular end-loc we're talking about here? e.g. for a function call, would this be the end of the function's name, or the closing paren?
For the end of the name, couldn't this be derived from the start loc + symbol name length (barring token pastes and escaped new lines in the middle of identifiers, which hopefully aren't too common)?
I can see the value for the closing paren though.
I mean the end of the name referencing the symbol, so that it can be highlighted properly when using the "find references in workspace" feature. There are cases where the name of the symbol itself is not present, for example "MyClass o1, o2;" (o1 and o2 reference the constructor), references to overloaded operators, etc.


https://reviews.llvm.org/D39050
Nathan Hawes via Phabricator via cfe-commits
2018-03-14 17:06:03 UTC
Permalink
nathawes added a comment.
Post by Marc-Andre Laperle via Phabricator via cfe-commits
Post by Nathan Hawes via Phabricator via cfe-commits
@malaperle Just to clarify, what's the particular end-loc we're talking about here? e.g. for a function call, would this be the end of the function's name, or the closing paren?
For the end of the name, couldn't this be derived from the start loc + symbol name length (barring token pastes and escaped new lines in the middle of identifiers, which hopefully aren't too common)?
I can see the value for the closing paren though.
I mean the end of the name referencing the symbol, so that it can be highlighted properly when using the "find references in workspace" feature. There are cases where the name of the symbol itself is not present, for example "MyClass o1, o2;" (o1 and o2 reference the constructor), references to overloaded operators, etc.
Ah, I see – thanks! I was thinking all occurrences whose symbol name didn't actually appear at their location were marked with `SymbolRole::Implicit`, but that only seems to be true for the ObjC index data.


https://reviews.llvm.org/D39050
Argyrios Kyrtzidis via Phabricator via cfe-commits
2018-03-14 20:44:13 UTC
Permalink
akyrtzi added a comment.

Hey Marc,
Post by Marc-Andre Laperle via Phabricator via cfe-commits
The fact that both clang and clangd have to agree on the format so that index-while-building can be used seems to make it inherently not possible to extend
I don't think "not possible to extend" is quite correct, we can make it so that the format allows optional data to be recorded.

On the topic of recording the end-loc, I agree it's not much data overall, but it will be useful to examine the uses closely and to figure out whether it's really required and whether it is at the same time inadequate for other uses.
Post by Marc-Andre Laperle via Phabricator via cfe-commits
I changed my prototype so that the end-loc is not stored in the index but rather computed "on the fly" using SourceManager and Lexer only.
I assume you used SingleFileParseMode+SkipFunctionBodies for this, right ?
Post by Marc-Andre Laperle via Phabricator via cfe-commits
For my little benchmark, I used the LLVM/Clang/Clangd code base which I queried for all references of "std" (the namespace) which is around 46K references in the index.
This is an interesting use case, and I can say we have some experience because Xcode has this functionality without requiring the end-loc for every reference.
So what it does is that it has a 'name' to look for (say 'foo' for the variable foo) and if it finds the name in the location then it highlights, otherwise if it doesn't find it (e.g. because it is an implicit reference) then it points to the location but doesn't highlight something. The same thing happens for operator overloads (the operators get highlighted at the reference location).
For implicit references it's most likely there's nothing to highlight so the end-loc will most likely be empty anyway (or same as start-loc ?) to indicate an empty range.
Good point, the parser definitely needs recovery improvements in C++.
Post by Marc-Andre Laperle via Phabricator via cfe-commits
With SkipFunctionBodies alone, I can get the Decl* but FunctionDecl::getSourceRange() doesn't include the body
This seems strange, there's an EndRangeLoc field that should have been filled in, not exactly sure if it is a bug or omission.

Going back to the topic of what use cases end-loc covers, note that it still seems inadequate for peek-definition functionality. You can't set it to body-end loc (otherwise occurrences highlighting would highlight the whole body which I think is undesirable) and you still need to include doc-comments if they exist.


https://reviews.llvm.org/D39050
Marc-Andre Laperle via Phabricator via cfe-commits
2018-03-16 18:49:39 UTC
Permalink
malaperle added a comment.
Post by Argyrios Kyrtzidis via Phabricator via cfe-commits
Hey Marc,
Post by Marc-Andre Laperle via Phabricator via cfe-commits
The fact that both clang and clangd have to agree on the format so that index-while-building can be used seems to make it inherently not possible to extend
I don't think "not possible to extend" is quite correct, we can make it so that the format allows optional data to be recorded.
That would be good. How would one go about asking Clang to generate this extra information? Would a Clang Plugin be suitable for this? I don't know much about those but perhaps that could be one way to extent the basic behavior of "-index_store_path" in this way?
Post by Argyrios Kyrtzidis via Phabricator via cfe-commits
Post by Marc-Andre Laperle via Phabricator via cfe-commits
I changed my prototype so that the end-loc is not stored in the index but rather computed "on the fly" using SourceManager and Lexer only.
I assume you used SingleFileParseMode+SkipFunctionBodies for this, right ?
No, sorry the end-locs I meant there is for occurrences. Only the lexer was needed to get the end of the token. So for "MyClass o1, o2;" o1 and o2 get highlighted as references to the MyClass constructor.
Post by Argyrios Kyrtzidis via Phabricator via cfe-commits
Post by Marc-Andre Laperle via Phabricator via cfe-commits
For my little benchmark, I used the LLVM/Clang/Clangd code base which I queried for all references of "std" (the namespace) which is around 46K references in the index.
This is an interesting use case, and I can say we have some experience because Xcode has this functionality without requiring the end-loc for every reference.
So what it does is that it has a 'name' to look for (say 'foo' for the variable foo) and if it finds the name in the location then it highlights, otherwise if it doesn't find it (e.g. because it is an implicit reference) then it points to the location but doesn't highlight something.
I think it's useful to highlight something even when the name is not there. For example in "MyClass o1, o2;" it feels natural that o1 and o2 would get highlighted.
Post by Argyrios Kyrtzidis via Phabricator via cfe-commits
The same thing happens for operator overloads (the operators get highlighted at the reference location).
It does? I can only seem to do a textual search. For example, if I look at "FileId::operator<", if I right-click in the middle of "operator<" and do "Find selected symbol in workspace", it seems to start a text based search because there are many results that are semantically unrelated.
Post by Argyrios Kyrtzidis via Phabricator via cfe-commits
For implicit references it's most likely there's nothing to highlight so the end-loc will most likely be empty anyway (or same as start-loc ?) to indicate an empty range.
I think for those cases the end of the token is probably suitable. Can you give examples which implicit references you have in mind? Maybe another one (other than the constructor mentioned above) could be a function call like "passMeAStdString(MyStringRef)", here the "operator std::string" would be called and MyStringRef could be highlighted, I think it would make sense to the user that is gets called by passing this parameter by seeing the highlight.
Post by Argyrios Kyrtzidis via Phabricator via cfe-commits
Going back to the topic of what use cases end-loc covers, note that it still seems inadequate for peek-definition functionality. You can't set it to body-end loc (otherwise occurrences highlighting would highlight the whole body which I think is undesirable) and you still need to include doc-comments if they exist.
I think maybe I wasn't clear, I was thinking about two end-locs: end-locs of occurrences and end-locs of bodies. The end-loc of occurrences would be used for highlight when searching for all occurrences and the end-loc for bodies would be used for the peek definition. I think we can disregard end-locs of bodies for now.


https://reviews.llvm.org/D39050
Argyrios Kyrtzidis via Phabricator via cfe-commits
2018-03-16 19:59:12 UTC
Permalink
akyrtzi added a comment.
Post by Marc-Andre Laperle via Phabricator via cfe-commits
That would be good. How would one go about asking Clang to generate this extra information? Would a Clang Plugin be suitable for this?
That's an interesting idea that we could explore, but I don't have much experience with that mechanism to comment on.
Post by Marc-Andre Laperle via Phabricator via cfe-commits
Only the lexer was needed to get the end of the token
Ok, that's interesting, not sure why Xcode is so fast to highlight, did you reuse same SourceManager/Lexer/buffers for occurrences from same file ? We'd definitely add the end-loc if we cannot come up with a mechanism to highlight fast enough without it.
Post by Marc-Andre Laperle via Phabricator via cfe-commits
I think it's useful to highlight something even when the name is not there. For example in "MyClass o1, o2;" it feels natural that o1 and o2 would get highlighted.
To clarify, even with implicit references the start loc points to something. In this case the implicit references can have start locs for the o1 and o2 identifiers and the end result for the UI will be the same (o1 and o2 get highlighted) even without having end-locs for all references.
Post by Marc-Andre Laperle via Phabricator via cfe-commits
It does? I can only seem to do a textual search.
The example I tried is the following. If you could file a bug report for the test case that did not work as you expected it would be much appreciated!

class Something1 {
public:
Something1() {}
~Something1() {}
operator int() {
return 0;
}

friend int operator <<(Something1 &p, Something1 &p2) {
return 0;
}
};

void foo1(Something1 p1, Something1 p2) {
p1 << p2;
p1 << p2;
}
Post by Marc-Andre Laperle via Phabricator via cfe-commits
here the "operator std::string" would be called and MyStringRef could be highlighted
Even without end-loc, the start loc could point to MyStringRef and you could highlight it.


https://reviews.llvm.org/D39050
Marc-Andre Laperle via Phabricator via cfe-commits
2018-03-19 02:24:53 UTC
Permalink
malaperle added a comment.
Post by Argyrios Kyrtzidis via Phabricator via cfe-commits
Post by Marc-Andre Laperle via Phabricator via cfe-commits
That would be good. How would one go about asking Clang to generate this extra information? Would a Clang Plugin be suitable for this?
That's an interesting idea that we could explore, but I don't have much experience with that mechanism to comment on.
Post by Marc-Andre Laperle via Phabricator via cfe-commits
Only the lexer was needed to get the end of the token
Ok, that's interesting, not sure why Xcode is so fast to highlight, did you reuse same SourceManager/Lexer/buffers for occurrences from same file ? We'd definitely add the end-loc if we cannot come up with a mechanism to highlight fast enough without it.
I don't think Xcode is quite fast, it's about 10 times slower (although I'm not sure it really finished) than when I use my branch that has the end-loc. I would try end-locs in Xcode if I could, to compare :) So I don't really know where the bottleneck is in Xcode. Comparing oranges to oranges, it's 4 times slower without end-locs compared to with end-locs on my branch. I does use the same SourceManager for the 46K references and I verified that it uses the same buffers, etc.
I'll put the numbers here again for readability.
Post by Argyrios Kyrtzidis via Phabricator via cfe-commits
For my little benchmark, I used the LLVM/Clang/Clangd code base which I queried for all references of "std" (the namespace) which is around 46K references in the index.
With end-loc in index: 3.45s on average (20 samples)
With end-loc computed on the fly: 11.33s on average (20 samples)
I also tried with Xcode but without too much success: it took about 30 secs to reach 45K results and then carried on for a long time and hung (although I didn't try to leave it for hours to see if it finished).
Post by Marc-Andre Laperle via Phabricator via cfe-commits
I think it's useful to highlight something even when the name is not there. For example in "MyClass o1, o2;" it feels natural that o1 and o2 would get highlighted.
To clarify, even with implicit references the start loc points to something. In this case the implicit references can have start locs for the o1 and o2 identifiers and the end result for the UI will be the same (o1 and o2 get highlighted) even without having end-locs for all references.
It's the same but slower. IMO, the trade off is not great. It's entirely subjective but I think 4-10 times slower in order to save an integer per occurrence is not worth it from my point of view.
Post by Argyrios Kyrtzidis via Phabricator via cfe-commits
Even without end-loc, the start loc could point to MyStringRef and you could highlight it.
(Same here, it's doable but faster if already in the index.)
Post by Argyrios Kyrtzidis via Phabricator via cfe-commits
Post by Marc-Andre Laperle via Phabricator via cfe-commits
It does? I can only seem to do a textual search.
The example I tried is the following. If you could file a bug report for the test case that did not work as you expected it would be much appreciated!
Sure, I'll give that a try and isolate it as much as I can. BTW, does it work for you on the LLVM code base?


https://reviews.llvm.org/D39050
Nathan Hawes via Phabricator via cfe-commits
2018-06-27 22:18:27 UTC
Permalink
nathawes updated this revision to Diff 153190.
nathawes added a comment.

Updated to apply on top-of-tree.


https://reviews.llvm.org/D39050

Files:
include/clang/Basic/AllDiagnostics.h
include/clang/Basic/CMakeLists.txt
include/clang/Basic/Diagnostic.td
include/clang/Basic/DiagnosticGroups.td
include/clang/Basic/DiagnosticIDs.h
include/clang/Basic/DiagnosticIndexKinds.td
include/clang/Driver/Job.h
include/clang/Driver/Options.td
include/clang/Frontend/CompilerInstance.h
include/clang/Frontend/FrontendOptions.h
include/clang/Index/DeclOccurrence.h
include/clang/Index/IndexDataConsumer.h
include/clang/Index/IndexDiagnostic.h
include/clang/Index/IndexingAction.h
include/clang/Index/RecordingAction.h
include/clang/Index/UnitIndexDataConsumer.h
include/clang/Index/UnitIndexingAction.h
include/clang/module.modulemap
lib/Basic/DiagnosticIDs.cpp
lib/Driver/Driver.cpp
lib/Driver/Job.cpp
lib/Driver/ToolChains/Clang.cpp
lib/Driver/ToolChains/Darwin.cpp
lib/Frontend/CompilerInstance.cpp
lib/Frontend/CompilerInvocation.cpp
lib/FrontendTool/CMakeLists.txt
lib/FrontendTool/ExecuteCompilerInvocation.cpp
lib/Index/CMakeLists.txt
lib/Index/FileIndexData.cpp
lib/Index/FileIndexData.h
lib/Index/IndexingAction.cpp
lib/Index/IndexingContext.cpp
lib/Index/IndexingContext.h
lib/Index/UnitIndexDataRecorder.cpp
lib/Index/UnitIndexDataRecorder.h
test/Index/Core/Inputs/module/ModDep.h
test/Index/Core/Inputs/module/ModSystem.h
test/Index/Core/Inputs/module/ModTop.h
test/Index/Core/Inputs/module/ModTopSub1.h
test/Index/Core/Inputs/module/ModTopSub2.h
test/Index/Core/Inputs/module/module.modulemap
test/Index/Core/Inputs/sys/system-head.h
test/Index/Core/Inputs/transitive-include.h
test/Index/Core/external-source-symbol-attr.m
test/Index/Core/index-instantiated-source.cpp
test/Index/Core/index-source.mm
test/Index/Core/index-subkinds.m
test/Index/Core/index-system.mm
test/Index/Core/index-unit.mm
test/Index/Store/assembly-invocation.c
tools/c-index-test/core_main.cpp
tools/diagtool/DiagnosticNames.cpp
tools/libclang/CXIndexDataConsumer.cpp
tools/libclang/CXIndexDataConsumer.h

Loading...