Reason for enumerated file names instead of actual file names #745

SlavaKeshkov · 2024-06-10T14:45:50Z

SlavaKeshkov
Jun 10, 2024

Hello Microsoft team,

A general question:

Is there any particular reason for using only file enumeration when building chunks in utilities.py? For example, write_chunk refers to filename as f"{file_number}.{i}.

Issue observed now:
Currently in our deployment of the accelerator filenames are supplied to LLM as File0 File1, etc. However, file name usually contains useful information / metadata that should be part of indexing, for example year of the document.

I am wondering if you see any issues or risks with teams modifying the chunk to include the file name as well? There could be other considerations Im overlooking

Thanks in advance for your replies

dayland · 2024-06-18T14:06:03Z

dayland
Jun 18, 2024
Maintainer

The file names are attributes of the index already. The chunks are simple appending a number to the end of the original file name that was uploaded. However, many attributes are added to the search index already. See the search index schema at https://github.com/microsoft/PubSec-Info-Assistant/blob/main/azure_search/create_vector_index.json

0 replies

SlavaKeshkov · 2024-06-25T16:08:10Z

SlavaKeshkov
Jun 25, 2024
Author

Hi @dayland

Thanks for your replies!

We have done additional testing and would like to follow up on this topic. A test was performed to load 2 PDF files to accelerator, among other PDF files already uploaded.

File 1:

Metadata:
- Filename = Dummy Example.pdf
Contents
- Title = “Text about sunflowers”
View file here

File 2:

Metadata:
- File name = Dummy Example.pdf
Contents
- Title = “Dummy example.pdf”
View file here

Prompt used:

"Provide the information in the source data that is provided by the file containing Dummy Example.pdf filename. Provide the information as-is, and exclude other sources"

File 1:

Test results: Fail
Explanation: Search query does not retrieve the file contents. File can NOT be seen in “Supported contents”

File 2:

Test results: Success
Explanation: Search query retrieves the file contents. File can be seen in “Supported contents”

Additional root cause analysis performed:

Number of sources was increased from 20 to 50 (maximum). In this case dummy example file started to show up in Spported contents, but at the very end of the sources list.

Root cause hypothesis: When searching for optimal chunk during search, filename seems to play a minor role. More priority in search indexing is given to the file contents than file name.

Solution ideas:

Modify the search parameters to give more weight to the file name
Add the filename as the “main title” parameter or append it before the “main title” so it becomes part of each chunk associated with this filename

We would appreciate your feedback on the following items:

Is our root cause analysis correct, can you confirm this issue?
Knowing the context of how overall accelerator works which of the solution direction you would recommend?

Thanks in advance!

1 reply

ravikhunt Jul 1, 2024

Hello @dayland
I hope you doing well, want to know if there is any update on this above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reason for enumerated file names instead of actual file names #745

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Reason for enumerated file names instead of actual file names #745

SlavaKeshkov Jun 10, 2024

Replies: 2 comments · 1 reply

dayland Jun 18, 2024 Maintainer

SlavaKeshkov Jun 25, 2024 Author

ravikhunt Jul 1, 2024

SlavaKeshkov
Jun 10, 2024

Replies: 2 comments 1 reply

dayland
Jun 18, 2024
Maintainer

SlavaKeshkov
Jun 25, 2024
Author