SE-Project

Predicting Query quality for Source Code Dataset Papers refered are linked in the repository.

Need to create a hidden folder called ".intermediate if you want to store the intermediate files in extract_dataset.py

Note that Coherence Score is still to be implemented. Note that whoosh is used instead of Lucene for indexing and searching.

Latest Commit Changes

~~added whoosh_src code to a class called IREngine~~
~~coded basic classifiers in train.py~~
added hyperparameters to classifiers.
tried implementing SMOTE(Sythetic Minority OverSampling Technique) to balance the dataset. Note that test sample is untouched.
data.csv has 21 pre-retrieval metrics + y label for the two datasets.

Requirements so far

nltk and corpus-data So far this is used to remove stop-words from the queries.
tqdm Used to show progress (based on iterations completed)
comment-parser Used to extract comments from source-code files
json Used to dump and load dictionaries from files.
statistics used for mean() pstdev() functions.
math used for log() function.
whoosh used to index and search the source code dataset to get the y values for training the classifiers.

Datasets used

CodeBlocks Source code http://sourceforge.net/projects/codeblocks/files/Sources/17.12/codeblocks-17.12-1.el7.centos.src.rpm
7-zip Source Code https://www.7-zip.org/a/7z1900-src.7z

Description of code so far

GLOBAL VARIABLES

dataset_directory_list - contains folder names of the datasets
file_extension_list - contains list of file extensions
stops_words - set of english stopwords
FILE_LIST - list of path of files
ERROR_LIST - subset of FILE_LIST that cause error when read().
dataDic - datadic = {datasetName:[list of path of valid files from that dataset]} datasetName = folder name of the dataset from dataset_directory_list path = path of the file from FILE_LIST
dataComments - dataComments = {filepath:Comments} filepath = file_path from FILE_LIST Comment = list of comments
metrics - metrics={dataset:{path:{comment:[AVGIDF]}}} dataset = name of the dataset folder path = path of the source code file comment = comment in the source code file The list stores values in the follwing order [AvgIdf,MaxIdf,DevIDF,AvgIctf,MaxIctf,DevIctf,AvgEntropy,MedEntropy,MaxEntropy,DevEntropy]

CLASS DESCRIPTION

Memoization - this class serves as the memory class to store calculated intermediate values that can be reused by other classes. The memoized values are:
- D_t[dataset][term] : Dictionary that stores list of paths of all document in the dataset that contain term.
- t_f[dataset][term] : Dictionary that stores the term-frequency of term in dataset.
- IDF[dataset][term] : Dictionary that stores the InverseDocument-Frequency of term in dataset.
- ICTF[dataset][term]: Dictionary that stores the Inverse Collection Term Frequency of term in dataset.
- ENTPY[dataset][term] : Dictionary that stores the entropy values of term in dataset.
- scq[dataset][term] : Dictionary that stores the scq similary value of term in dataset
- W_BAR[dataset][term] : Dictionary that stores the w-average values of term in dataset.
- Var[dataset][term]: Dictionary that stores the Var Coherency value of term in dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
Papers		Papers
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SE-Project

Latest Commit Changes

Requirements so far

Datasets used

Description of code so far

GLOBAL VARIABLES

CLASS DESCRIPTION

FUNCTION DESCRIPTION

About

Releases

Packages

Languages

Aditya7799/Text-Retrieval-for-Software-Engineering-Tasks

Folders and files

Latest commit

History

Repository files navigation

SE-Project

Latest Commit Changes

Requirements so far

Datasets used

Description of code so far

GLOBAL VARIABLES

CLASS DESCRIPTION

FUNCTION DESCRIPTION

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages