Transform for Code Profiling #646

pankajskku · 2024-09-30T15:38:46Z

This tranform extracts the base syntactic concepts from the multi-language source codes and represent these concepts in an unified langauge-agnostic representation that can be further used for multi-lnaguage data profiling. While programming languages expose similar syntactic building blocks to represent programming intent, such as importing packages/libraries, functions, classes, loops, conditionals, comments and others, these concepts are expressed through language-specific grammar, defined by distinct keywords and syntactic form.

Why are these changes needed?

Data profiling, in the context of machine learning, is the process of examining and analyzing data to create
useful statistics. These statistics are used both as an aid for better comprehension of the properties of data as
well as for a variety of downstream data processing tasks such as data valuation (assessing the value of data
relative to the business objectives at hand) and data curation (filtering and prioritizing training data based on
derived thresholds). In the Large Language Model (LLM) setting, training data is typically unstructured in
nature comprising natural language text, images, and code. In this work, we specifically focus on code-LLMs,
where the quality of code training data substantially affects the model accuracy of LLM-based coding tasks
such as code generation and summarization. Therefore, having the capabilities to characterize code data in
terms of programming language concepts aids in both deriving insights related to code training/evaluation
data and in the downstream curation of code training data. In this work, we address the problem of profiling
multi-lingual code datasets by extracting an extensible user-defined set of syntactic concepts
over arbitrary programming languages.

Related issue number (if any).

This tranform extracts the base syntactic concepts from the multi-language source codes and represent these concepts in an unified langauge-agnostic representation that can be further used for multi-lnaguage data profiling. While programming languages expose similar syntactic building blocks to represent programming intent, such as importing packages/libraries, functions, classes, loops, conditionals, comments and others, these concepts are expressed through language-specific grammar, defined by distinct keywords and syntactic form. Signed-off-by: Pankaj Thorat <[email protected]>

Signed-off-by: Pankaj Thorat <[email protected]>

pankajskku · 2024-10-11T05:25:57Z

@daw3rd Please let me know your opinion on the updated PR.

touma-I

@pankajskku Please slack me when you have a chance. Internal ID: [email protected]

Signed-off-by: Pankaj Thorat <[email protected]>

transforms/code/code_profiler/python/Dockerfile

touma-I

@pankajskku Please see additional changes. I was pulled how your UT was passing without the proper requirements.txt in the Dockerfile.

transforms/code/code_profiler/python/pyproject.toml

Co-authored-by: touma-I <[email protected]>

Signed-off-by: Pankaj Thorat <[email protected]>

This was addressed

pankajskku force-pushed the dev-pankaj branch 7 times, most recently from eea6e72 to 47b9dcd Compare September 30, 2024 21:05

This comment was marked as resolved.

Sign in to view

pankajskku force-pushed the dev-pankaj branch 21 times, most recently from cc43bb7 to 6294b2d Compare October 2, 2024 10:21

pankajskku force-pushed the dev-pankaj branch from bd08600 to a89b361 Compare October 9, 2024 08:44

Synced the repo

73286ff

Signed-off-by: Pankaj Thorat <[email protected]>

pankajskku force-pushed the dev-pankaj branch 3 times, most recently from dba3567 to 627b4db Compare October 10, 2024 08:22

touma-I self-requested a review October 12, 2024 15:29

touma-I requested changes Oct 12, 2024

View reviewed changes

pankajskku force-pushed the dev-pankaj branch 2 times, most recently from 46d3e2d to 7e183d0 Compare October 15, 2024 09:09

Merge remote-tracking branch 'upstream/dev' into dev-pankaj

4f0bdd4

Signed-off-by: Pankaj Thorat <[email protected]>

pankajskku force-pushed the dev-pankaj branch from 7e183d0 to 4f0bdd4 Compare October 15, 2024 09:11

Merge branch 'IBM:dev' into dev-pankaj

892283b

pankajskku changed the title ~~Transform for Syntactic Construct Extractor~~ Transform for Code Profiling Oct 15, 2024

pankajskku requested a review from touma-I October 15, 2024 13:23

Renaming the transform to code-profiler from syntactic-concept-extractor

39158a5

Signed-off-by: Pankaj Thorat <[email protected]>

pankajskku force-pushed the dev-pankaj branch from 143c054 to 39158a5 Compare October 15, 2024 13:30

touma-I requested changes Oct 15, 2024

View reviewed changes

transforms/code/code_profiler/python/Dockerfile Show resolved Hide resolved

touma-I requested changes Oct 15, 2024

View reviewed changes

transforms/code/code_profiler/python/pyproject.toml Show resolved Hide resolved

transforms/code/code_profiler/python/pyproject.toml Outdated Show resolved Hide resolved

transforms/code/code_profiler/python/pyproject.toml Outdated Show resolved Hide resolved

pankajskku and others added 6 commits October 16, 2024 16:55

Update transforms/code/code_profiler/python/Dockerfile

8da376c

Co-authored-by: touma-I <[email protected]>

Update transforms/code/code_profiler/python/pyproject.toml

685ba02

Co-authored-by: touma-I <[email protected]>

Update transforms/code/code_profiler/python/pyproject.toml

7a5ef14

Co-authored-by: touma-I <[email protected]>

Update transforms/code/code_profiler/python/pyproject.toml

5b879b7

Co-authored-by: touma-I <[email protected]>

Merge branch 'IBM:dev' into dev-pankaj

62eba87

Merge branch 'IBM:dev' into dev-pankaj

efcf6d0

Signed-off-by: Pankaj Thorat <[email protected]>

pankajskku force-pushed the dev-pankaj branch from 11cb4d5 to efcf6d0 Compare October 16, 2024 12:18

touma-I self-requested a review October 16, 2024 12:20

touma-I approved these changes Oct 16, 2024

View reviewed changes

touma-I merged commit 7b0ff94 into IBM:dev Oct 16, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transform for Code Profiling #646

Transform for Code Profiling #646

pankajskku commented Sep 30, 2024

This comment was marked as resolved.

pankajskku commented Oct 11, 2024

touma-I left a comment

touma-I left a comment

Transform for Code Profiling #646

Transform for Code Profiling #646

Conversation

pankajskku commented Sep 30, 2024

Why are these changes needed?

Related issue number (if any).

This comment was marked as resolved.

pankajskku commented Oct 11, 2024

touma-I left a comment

Choose a reason for hiding this comment

touma-I left a comment

Choose a reason for hiding this comment