Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is the separate .st and .json going to shoot us in the foot? #2

Open
travisgriggs opened this issue Mar 26, 2012 · 14 comments
Open

Is the separate .st and .json going to shoot us in the foot? #2

travisgriggs opened this issue Mar 26, 2012 · 14 comments

Comments

@travisgriggs
Copy link

Post excellent dinner and napkin signing experience, I've been pondering in the back of my mind. One of the reservations I have in retrospect is the .json file per method. When I did the first STIG, it was an attempt to see if "file per structural entity" was workable at all. As I've stated, I was surprised about how un-dismal things were. But I still think there's a balance. The controlling factor in lots of little files, is the minimum disk block size. For these file-per-method schemes, method files dominate the storage scheme. 99% of the files will be method related, with a small fraction being used to define classes and other structuring elements. And almost all of the methods will be under the minimum disk block size. This means that disk usage will usually be roughly nummethods+minimumblocksize + "some change". When we do a .json file per method, you're basically instantly doubling the amount of disk space and the amount of files that GIT has to manage. I don't have a counter proposal. What I did for STIG, doesn't really meet what the basic goals of Cypress are. But I do question whether we're doing the right thing.

@dalehenrich
Copy link
Member

Excellent point ... for the time being I'm going to vote for "making it work" over "making it small" ... however ....

I originally went with the chunk format because it allowed for "properties" while still looking like smalltalk (kinda ugly though) ... perhaps we can return to the chunk format, only with custom method handling so instead of this:

!MCFileTreeAbstractReader methodsFor: 'utilities'!
addClassAndMethodDefinitionsFromDirectory: aDirectory
    self subclassResponsibility! !

we do something like this:

!CypressMethodLoader loadMethodsForClassNamed: 'MCFileTreeAbstractReader' inProtocol: #('utilities') !
addClassAndMethodDefinitionsFromDirectory: aDirectory
    self subclassResponsibility! !

We can add some more args if we need them for additional method properties and we're back to a single .st file per method...it still formats as smalltalk and is even readable!

@dalehenrich
Copy link
Member

Hmmmm, suppose we should avoid executable code...

@mkobetic
Copy link
Member

The plain format of the method.st file has its appeal in being so simple. Another aspect that we viewed favourably when we talked about this was that simple protocol move will be easy to distinguish from an actual method change. So I'd be inclined to stick with the plan for now and see. But if measurements or experience shows that doubling the files per method is an issue, I'd certainly reconsider.

@dalehenrich
Copy link
Member

Agreed that the plain format is simple and easy on the eyes when reading the source code for a method, but there are several other factors to consider:

  • Last night as I read code in the STIG github repository, several times I found myself switching to the json file for the method to see what protocol the method was in ... I'm used to seeing that information and at the time I was trying to grab some methods and suck them into Amber, so the protol mattered.
  • The other thing I ran into is that I kept clicking on the wrong method ... it is quite intimidating to see this big list of files and parsing the extension to figure out which one I want to look at ...
  • github looks at file extensions to figure out what language to categorize the project with (note that the amber-cypress project is a Smalltalk project, while the STIG project is currently uncharacterized). In earlier versions of the amber-cypress project, the project was categorized as Javascript by github, because there were more .js files that .st files in the project. I reduced the number of .js files used and voila, it became a Smalltalk project ... It would be a shame if folks who used the standard smalltalk package format had all of their projects characterized as something other than Smalltalk ... I actually don't know how github treats .json extenstions, so perhaps github is being confused by the directories with extensions...Sometimes it just takes a long time. So at one level this is a trivial matter, but on another level, this is maybe the most important.
  • it's true that protocol changes will lead to "spurious" changes to the .st file, but something has to change and merging a single file that has both protocol and content changes is actually easier than dealing with twice as many files...

As I've written this up, it seems to me that we could do something similar to what Travis originally did, using a modified chunk format where we have multiple leading protocol names delimited by !'s:

!testing!
isOkay
  ^#'???"
!testing!
!querying!
isOkay
  ^#'???"

We could even do something like the following where we just tokenize the string between the !'s:

!category: 'testing'!
!timestamp: 'dkh 1/1/2012 08:45:13'!
isOkay
  ^#'???"

I actually don't think that the above is that egregious, especially since the usual case has a single line:

!category: 'testing'!
isOkay
  ^#'???"

To be honest the github language assignment issue actually bugs me more than anything else:)

@mkobetic
Copy link
Member

I don't really have a strong opinion either way at this point. I would like to focus on finishing a working prototype first, but I should be able to handle few more format changes along the way. I wonder what is required of the .st file for github to recognize it as such. Since we don't actually file-in the sources, we don't have to stick to a chunk-like format. For all intents and purposes we could just concatenate the json file with the st file. The advantage there would be that we don't need yet another "property parser". But what you propose should be fairly easy to parse as well, so not that big a deal, I guess I'd pick what's easier on the eyes.

@dalehenrich
Copy link
Member

I had separate files for method and category/author at one point in FileTree and almost immediately moved to a single file because of the visual noise...

.st file extension is all that github looks at ... they don't bother to open the file and peek at the contents ... so they aren't necessarily accurate:)

If the:

!category: 'testing'!
isOkay
  ^#'???"

is easy enough to parse then let's go with it ...if we start stacking up more properties for methods, then we'll have to go to plan B ... BTW, I figure that the method properties could be optional ... if there are more than a category, then the method.json file will have the additional properties (for now no optional property file) ...

I think with this change the format will be fixed for the initial release and we can work on implementations ...

@mkobetic
Copy link
Member

"Dale Henrichs"[email protected] wrote:

is easy enough to parse then let's go with it ...if we start stacking up more properties for methods, then we'll have to go to plan B ... BTW, I figure that the method properties could be optional ... if there are more than a category, then the method.json file will have the additional properties (for now no optional property file) ...

In that case I'd be inclined to do what Travis did originally. First line is the protocol, the rest is the method. No bangs, no pretending that this is a chunk format. Would you agree with that ?

@dalehenrich
Copy link
Member

Yes, let's take that for a spin ...

----- Original Message -----
| From: "Martin Kobetic" [email protected]
| To: "Dale Henrichs" [email protected]
| Sent: Friday, March 30, 2012 1:09:06 PM
| Subject: Re: [Cypress] Is the separate .st and .json going to shoot us in the foot? (#2)
|
| "Dale Henrichs"[email protected] wrote:
| > is easy enough to parse then let's go with it ...if we start
| > stacking up more properties for methods, then we'll have to go to
| > plan B ... BTW, I figure that the method properties could be
| > optional ... if there are more than a category, then the
| > method.json file will have the additional properties (for now no
| > optional property file) ...
|
| In that case I'd be inclined to do what Travis did originally. First
| line is the protocol, the rest is the method. No bangs, no
| pretending that this is a chunk format. Would you agree with that ?
|
| ---
| Reply to this email directly or view it on GitHub:
| #2 (comment)
|

@travisgriggs
Copy link
Author

It's all a balance. I think it's important to maintain pragmatism in the whole thing. For example, we worry a little bit about a Smalltalk that does multiple categories per method. But the reality is... only Dolphin does that, and it's not busting down the doors of a comeback at the moment. Dale's approach is interesting to me, because it gives a mechanism to extend other metadata into the method, and should relatively quick. One could make it terser by saying metadata is always reduced to a single line (so the terminator is a cr), and starts with a tab. The tab is interesting, because visually it pulls the metadata separate, makes it a sort of header, and avoids clutter. And methods don't start with tabs usually. I'm fine saying that in fact, they shouldn't for this to work.

A total different way to do this would be to use directories. In some ways, I think this is very pragmatic. It avoids this whole overloading of a file-per-method. One of schemes I did was this way. I drew away from it for two reasons. One, I was worried about spurious category changes making noise. Upon reflection, I just don't think this is a strong case. I've watched hundreds of deltas with the new VW comparison tool, and it marks category changes distinct from other cases. And the heuristic fact, is that category changes are a a small fraction of the changes, especially compared to real content changes. The second reason, was that I was worried that GIT wouldn't notice a directory move as just that, but would instead see it as a remove/new file. I was actually surprised here how good GIT was at noting that I had simply moved a file to another directory.

One things you lose with encoding categories as directories is the ability to do mutlicategories. My response is "meh, so what." It's an interesting idea that's been kicked around for years, and never jumped out strong enough to actually happen. The other is the ability to see all of the methods at once when browsing. We partially lost this already when we split class/inst anyway. And again, it's a balance. There's no way to design a scheme that is awesomely efficient, extensible, and browsable all at the same time.

Waxing philosophical, working on this, helped me realize why Smalltalk will never become a "file based" thing. The whole idea of "everything is an object" and "everything gets done with lots of little methods" just scales beyond the abilities of file manager tools to navigate and deal with that number of files.

@dalehenrich
Copy link
Member

Agree with your conclusion: "Smalltalk will never become a file based thing". The tricks that we are playing with the directory structure and files is a balancing act between maintaining some level of readability and preserving the required object-ness.

The fact that over time we will support multiple formats will allow folks to make these tradeoffs on a project by project basis. The guys that want to edit Smalltalk code in Emacs will be able to pick a format that is friendlier to the text editor (single file), but they will give up method history .... and so on.

Github is an important component here ... with github, I can reference the source for a particular method and line line number range in an email message and that's something that we Smalltalkers have never really been able to do and I think it can really help our development going forward ... To leverage this capability we need to be able to navigate the directory structure a bit ...

Along these lines I've talked to Johnny about using Amber to browse the source code on github. I've already got Amber hosted from Github and Amber is using the GitHub web api to extract information about the project, so it isn't a big extension to browsing source with a real code browser ...

@bnemec
Copy link
Member

bnemec commented Apr 1, 2012

VA allows for multiple categories per method, it supports flagging of methods are private or public, as well as a method comment & method note which are accessed by their own tab in the browser.

If the core requirement is to support cross dialect code, which requires a super set of meta data, what would be the down side of using XML? That would embed the extra data that we need for each method within the method file without requiring a novel file format. Would working with XML as the method file format be a problem? I'm thinking of tools like code compare: if the tags are consistent, the code delta should display nicely.

If we only support a subset of method metadata, then I can import with default options: public, no method comment or notes, and only one category. So far everything looks straight forward for a VA implementation.

I'm new to Git, so it would not surprise me that I may be missing something obvious.

@dalehenrich
Copy link
Member

Bob,

I went with JSON, because I had a simple JSON parser and I think that JSON is more readable than XML ...

I think we'd like to avoid too much structure in the method source files ... which is why we're toying with chunk format and simple formats for that file...

Right now my inclination would be to use the separate .json file a method when it is in more than one protocol and go with Travis' simple format when there is only one....

the separate .json file would be the perfect place for you to store the full range of method meta data for VA methods...

@bnemec
Copy link
Member

bnemec commented Apr 1, 2012

One file per method + json for extra meta data would work well. For VA, that would be public methods in one category with no method commends or notes; a majority of methods.

@mkobetic
Copy link
Member

mkobetic commented Apr 5, 2012

The latest STIG version merges the categories into the .st file as we discussed. Currently that means that STIG will never generate a method level json file and will ignore it if there is one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants