Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Project Status and Potential Contributions #1

Open
Hannibal046 opened this issue Jan 18, 2025 · 3 comments
Open

Question about Project Status and Potential Contributions #1

Hannibal046 opened this issue Jan 18, 2025 · 3 comments

Comments

@Hannibal046
Copy link

Hi Team,

First, I want to express my appreciation for maintaining this repository and fla. I'm finding both projects very valuable.

I have several questions about the project:

  1. Project Status

    • Is this repository actively being developed?
    • Are you accepting external contributions?
  2. Development Direction

    • If external contributions are welcome, could you share your roadmap?
    • I'd be interested in contributing based on the project's goals.
  3. Technical Architecture
    From my understanding:

    • The project uses fla for model definition
    • Training is handled by TorchTrain
    • Given fla's HuggingFace compatibility, it should work with lm-eval-harness for evaluation
      Could you confirm if this understanding is correct?
  4. Future Plans

    • Are there plans to extend into post-training scenarios?
    • If so, open-instruct could be a valuable reference point.

Looking forward to your response and potentially contributing to the project.

Best regards

@yzhangcs
Copy link
Member

@Hannibal046 Hi, yes, all of your understandings are correct.
This project is actively being developed, and we're continuously adding more features to flame. For example:

  1. While torchtitan only supports 4D parallelism for Llama, we aim to provide comprehensive support for all FLA.
  2. We're implementing support for online data tokenization with shuffling, which is currently lacking in torchtitan.

Regarding post-training, I don't have extensive experience in this field yet. However, I'd be very glad if you could contribute in this area. I also plan to add support for post-training features in the future.
In a word, flame is a framework closely integrated with fla and transformers, with ambitions to scale to much larger scales.

@Hannibal046
Copy link
Author

Hannibal046 commented Jan 18, 2025

@yzhangcs
Hi, thanks for the quick reply!

If you're planning to implement support for online data tokenization with shuffling, I'd like to share an elegant implementation from Meta Lingua for your reference. Their approach:

  1. Pre-shuffles data;
  2. Accept Jsonline as inputs and performs online tokenization and reshuffling with a buffer;
  3. Easily controls the ratio of different data sources;

I'm not sure which specific features you need to implement, but relying solely on "2. online tokenization and reshuffling with a buffer" might not be sufficient for large-scale training. This is because some datasets from Hugging Face are chronologically ordered, and even with a large online buffer, the data would still be biased.

I'm happy to help if you need any assistance!

@yzhangcs
Copy link
Member

Thank you! I will be taking a look at it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants