Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



23 Commits

Repository files navigation


Predicting Query quality for Source Code Dataset Papers refered are linked in the repository.

Need to create a hidden folder called ".intermediate if you want to store the intermediate files in

Note that Coherence Score is still to be implemented. Note that whoosh is used instead of Lucene for indexing and searching.

Latest Commit Changes

  • added whoosh_src code to a class called IREngine
  • coded basic classifiers in
  • added hyperparameters to classifiers.
  • tried implementing SMOTE(Sythetic Minority OverSampling Technique) to balance the dataset. Note that test sample is untouched.
  • data.csv has 21 pre-retrieval metrics + y label for the two datasets.

Requirements so far

  1. nltk and corpus-data So far this is used to remove stop-words from the queries.
  2. tqdm Used to show progress (based on iterations completed)
  3. comment-parser Used to extract comments from source-code files
  4. json Used to dump and load dictionaries from files.
  5. statistics used for mean() pstdev() functions.
  6. math used for log() function.
  7. whoosh used to index and search the source code dataset to get the y values for training the classifiers.

Datasets used

  1. CodeBlocks Source code
  2. 7-zip Source Code

Description of code so far


  1. dataset_directory_list - contains folder names of the datasets
  2. file_extension_list - contains list of file extensions
  3. stops_words - set of english stopwords
  4. FILE_LIST - list of path of files
  5. ERROR_LIST - subset of FILE_LIST that cause error when read().
  6. dataDic - datadic = {datasetName:[list of path of valid files from that dataset]} datasetName = folder name of the dataset from dataset_directory_list path = path of the file from FILE_LIST
  7. dataComments - dataComments = {filepath:Comments} filepath = file_path from FILE_LIST Comment = list of comments
  8. metrics - metrics={dataset:{path:{comment:[AVGIDF]}}} dataset = name of the dataset folder path = path of the source code file comment = comment in the source code file The list stores values in the follwing order [AvgIdf,MaxIdf,DevIDF,AvgIctf,MaxIctf,DevIctf,AvgEntropy,MedEntropy,MaxEntropy,DevEntropy]


  1. Memoization - this class serves as the memory class to store calculated intermediate values that can be reused by other classes. The memoized values are:
    • D_t[dataset][term] : Dictionary that stores list of paths of all document in the dataset that contain term.
    • t_f[dataset][term] : Dictionary that stores the term-frequency of term in dataset.
    • IDF[dataset][term] : Dictionary that stores the InverseDocument-Frequency of term in dataset.
    • ICTF[dataset][term]: Dictionary that stores the Inverse Collection Term Frequency of term in dataset.
    • ENTPY[dataset][term] : Dictionary that stores the entropy values of term in dataset.
    • scq[dataset][term] : Dictionary that stores the scq similary value of term in dataset
    • W_BAR[dataset][term] : Dictionary that stores the w-average values of term in dataset.
    • Var[dataset][term]: Dictionary that stores the Var Coherency value of term in dataset.