DocumentSearch

The goal of this exercise is to create a working program to search a set of documents for the given search term or phrase (single token), and return results in order of relevance. Relevancy is defined as number of times the exact term or phrase appears in the document. Create three methods for searching the documents: • Simple string matching • Text search using regular expressions • Preprocess the content and then search the index Prompt the user to enter a search term and search method, execute the search, and return results.

Three files have been provided for you to read and use as sample search content. Run a performance test that does 2M searches with random search terms, and measures execution time. Which approach is fastest? Why? Provide some thoughts on what you would do on the software or hardware side to make this program scale to handle massive content and/or very large request volume (5000 requests/second or more).  

This application is built and runs in: IntelliJ IDEA 2018.1 (Community Edition) Build #IC-181.4203.550, built on March 26, 2018 JRE: 1.8.0_152-release-1136-b20 amd64 JVM: OpenJDK 64-Bit Server VM by JetBrains s.r.o Windows 10 10.0

You'll also need Kotlin and JUnit installed. Kotlin is bundled with IntelliJ IDEA starting from version 15.

To test/run:

Clone repository
Import project into IntelliJ
Run main function in Main.kt
Enter search term in console window
Enter search method (1 for string match, 2 for regex, or 3 for indexed) in console window
Observe output in console window

I chose to write this application in Kotlin since I have been learning it and teaching myself recently and I thought this would be a good way to reinforce that knowledge. I also use Kotlin in Android development professionally and in my personal projects. Writing this application in Kotlin allows me to showcase the fact that I know Kotlin plus reap the benefits of its features. I use Kotlin's object-oriented and functional features at different points when I feel they're useful. Namely, abstract class, inheritance, and lambdas. I use a priority queue to keep output sorted by relevance at all times.

Indexed search is the fastest because of the typical O(1) complexity of putting and getting elements in hash map.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.idea		.idea
lib		lib
out/production/DocumentSearch		out/production/DocumentSearch
src		src
DocumentSearch.iml		DocumentSearch.iml
README.md		README.md
french_armed_forces.txt		french_armed_forces.txt
hitchhikers.txt		hitchhikers.txt
warp_drive.txt		warp_drive.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocumentSearch

About

Releases

Packages

Languages

terryschmidt/DocumentSearch

Folders and files

Latest commit

History

Repository files navigation

DocumentSearch

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages