ByteSize: Document Search with Semantic Understanding

Discovering precise details within documents can be difficult, especially when using traditional search methods that need verbatim matches. ByteSize improves this process by allowing users to find information in documents using related terms, making searches easier and more effective.

The Concept Behind ByteSize

I developed ByteSize at UMD's Bitcamp Hackathon, a great experience where I learned a lot more about full-stack development and semantic search. The idea came to me when I was trying to parse through a new document using Command-F, which requires phrases to be searched verbatim as they appear in the document. This whole process was very restrictive and inefficient. ByteSize utilizes semantic understanding, allowing users to search using phrases and related concepts, extending the scope and accuracy of search results.

How ByteSize Works

ByteSize employs a semantic similarity model that compares the meaning of the search term to the content of the pages. This methodology, based on this Semantic Similarity model, enables more flexible and context-aware searches.

ByteSize Demo

Our page is shown in the image above. The user's text is inputted in the top left element, and the paraphrased search key is entered in the field below. The results, displayed on the right, show the relevant sentences as well as the model's scores, which indicate how well each sentence matches the natural language search query. The code can be further explored here.

Areas for Improvement

While ByteSize is a powerful tool, there are areas that can be refined:

  1. Relevance of Results: In some cases, the sentences proposed by the model are not highly relevant to the search key. Improving the model's accuracy in understanding context and relevance can enhance the search experience.
  2. UI Bugs: The current UI has some issues, such as rendering problems with repeated searches and difficulties handling multiple search keys. Addressing these bugs will make the tool more robust and user-friendly.

Conclusion

At the time of this project, famous LLM models like ChatGPT were not available, so this was a captivating introduction to the world of Natural Language Processing (NLP) and semantic search. Nowadays, it's interesting to see how companies like Glean are scaling these capabilities to work on an enterprise level.