Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a parser to easily create dialogues from plain text files #10

Closed
ameamenoame opened this issue Mar 2, 2021 · 19 comments · Fixed by #12
Closed

Add a parser to easily create dialogues from plain text files #10

ameamenoame opened this issue Mar 2, 2021 · 19 comments · Fixed by #12
Assignees

Comments

@ameamenoame
Copy link
Contributor

ameamenoame commented Mar 2, 2021

Right now the demo uses the JSON-like .scene files for the dialogue and scripting which would be cumbersome to write new scenes with. We should build a proper parser first before tackling issue #5.

An example script could be like this:

<<set_bg hallway>>
<<show Sophia left fade 1.5>>
// Comment here
Sophia, happy: "Hello!"
<<show Dani right slide 1>>
Dani, surprised: "Hi there!"
...

With a simple line-by-line loop, we can easily parse this into .scene files for the demo's dialogue system to read.

@NathanLovato
Copy link
Contributor

I considered that, but there are also great open-source text editors out there and for the visual novel course we're working on, I'd prefer using one of them.

As much as I like text languages like Ren'Py's, they're not ideal to keep track of story branches and variables.

@NathanLovato
Copy link
Contributor

NathanLovato commented Mar 2, 2021

I gave it some more thought. My problem's mostly that I don't have time to work on this right now. Would you write the complete tokenizer + parser? And do you have availability to work on it these days?

If you can do that without breaking compatibility with the ScenePlayer, I'd be willing to compensate you for tackling this.

If so, you can email me at nathan [at] gdquest dot com

@ameamenoame
Copy link
Contributor Author

If I'm compensated, sure. I have sent you an email. Glad to be working with you!

@NathanLovato NathanLovato reopened this Mar 2, 2021
@ameamenoame
Copy link
Contributor Author

ameamenoame commented Mar 3, 2021

Here's what a script may look like that will be exactly parsed into the 1.scene file (with some minor additions):

# Set the background image with the (optional) specified transition
background industrial_building with fade_in

# Create a jump point that the script can run back to
# After the jump the script will continue from the line below this declaration
mark jump_location

# Show Sophia with the 'enter' animation on the left side if they aren't visible yet
# and set the expression to 'happy'
sophia happy enter left "Hi there! My name's Sophia. How about you?"

dani neutral enter right "Hey, I'm Dani."

# Use indentation to separate the choices
choice:
  "Start over":
    # Jump forward/backward to a jump location somewhere in the script
    # We set the 'target' property with (jump_location_index - current_index)
    jump jump_location
  "Continue":
    # Continue onto the next line
    pass
  "Jump ahead":
    # Store the player's choice in a variable
    # We'll be storing this in a save file
    jumped_ahead = true

    jump next_jump_point 

sophia "Well, let's continue."

mark next_jump_point

# Use indentation for conditional blocks
if jumped_ahead:
  # The specific node for the below line in 1.scene also has the 'background' property set. 
  # I don't quite understand what you want to do with that. 
  # I think we should only allow the background to set in a separate line but it's your call.
  dani "Did you jump ahead?" at_background steppes

# Narrator
"And thus started Sophia and Dani's journey."

# Change scene with specified transition
next_scene with fade_out

What do you think?

I'll have some working code by Friday.

@NathanLovato
Copy link
Contributor

I'm on the go so can't detail right now but I see some things you can simplify here.

Definitely first only one command per line.

No need for "with" on the background command. Think of lines as one command at the start with optional positional arguments.

For variables, I'd use set variable_name value for consistency: commands everywhere.

For blocks, it may be simpler to use no colon, and an "end" keyword at the end, both for the user and you. No need to track indents.

I'll think more about it.

But in general, try to think in terms of strict rules for how the language works.

Like most lines being in the form command argument1 argument2 ...

The exception being characters, where the say command could be implicit, and a default if the line doesn't start with a reserved keyword (command, choice...)

@NathanLovato
Copy link
Contributor

Here's how I'd write the script you suggested

Notable differences:

  • Every command starts with a word. This means no equal sign to assign a variable, things like these.
  • You only have commands with positional arguments. No "with" to tell you want a transition. This will simplify your work a bit.
  • No mandatory "next_scene" command to go to the next scene. When a scene ends, the Main game node should handle that.
  • However, we'll need a "scene" command to go to another scene.
  • A transition command specifically for transitions.
background industrial_building fade_in
mark jump_location

sophia happy enter left "Hi there! My name's Sophia. How about you?"
dani neutral enter right "Hey, I'm Dani."

choice:
	"Start over":
		jump jump_location
	"Continue":
		pass
	"Jump ahead":
		set jumped_ahead true
		jump next_jump_point 

sophia "Well, let's continue."
mark next_jump_point

if jumped_ahead:
	background steppes
	dani "Did you jump ahead?" 
	scene next_scene

"And thus started Sophia and Dani's journey."

transition fade_out

@NathanLovato
Copy link
Contributor

As far as writing a language, there are typically two tools you need to code:

  1. The tokenizer (also called lexer). It's a script that takes the text file and breaks it down into tokens, that is, logical chunks that make up your language.
    • You can have multiple "sub-tokenizers". For example, you can have a main tokenizer that'll differentiate regular lines from a choice. It looks at where the choice starts and ends and passes those lines to a dedicated ChoiceTokenizer. It can be easier than trying to tokenize everything with one big script.
    • The tokenizer just stores things like "this is a command", "this is a string", "this is a choice".
  2. The parser takes the output from the tokenizer and makes sense of it.

Here's an example of a tokenizer for a command-line written in GDScript: https://github.com/winston-yallow/gd-tokenizer/blob/master/cmd/CommandParser.gd

You shouldn't necessarily copy it, I think you can go with something simpler perhaps. For instance, instead of objects, you may use dictionaries, having each token be like:

# For a command
{ type = "command", arguments = [argument1, argument2, argument3] }

# For a choice, the content is all the lines inside the choice block, that you'll break down recursively into other tokens
{ type = "choice", content = "..." } 

But you'll need something like that and then to translate it into an abstract syntax tree. The class that does that is called a parser. Think of it as a big dictionary that gives you a hierarchy of instructions that's a bit like the current scenes' data.

The good news here is that our ScenePlayer is already an interpreter and the data format of scenes is what you could already target with your parser.

Note that unlike in existing scene files, you don't need to use numbers for the nodes' next key in your syntax tree.

Instead, you can find a naming scheme for the keys. Renpy seems to try to give each key in the abstract syntax tree a name.

Finally, the parser should be decoupled from the tokenizer (make two scripts without dependencies) and work exclusively from the tokenizer's output. This makes extending, modifying, and fixing the language easier as we can edit either the tokenizer or the parser without breaking the other.

@NathanLovato
Copy link
Contributor

One last tip to tokenize lines. You'll see most tokenizers out there iterate over characters, which is one approach.

Another would be to use regular expressions, which might work for this mini-language. Depends if you're comfortable with regexp or not. If not, looping over characters may be easier.

Note you can also use String methods to help you perhaps. Its functions to remove whitespace and split words could come in handy.

@ameamenoame
Copy link
Contributor Author

Here's the rough draft of the tokenizer I promised:
https://pastebin.com/iNDy2mtp

Right now the tokenizer can handle basic symbols, functions with arguments, escape characters in string literals. I have not yet get to properly handling choice and if blocks though. You can expect that to be done by next Monday after which I'll start work on the actual parser.

The code above, with the updated script you provided:

background industrial_building fade_in
mark jump_location

sophia happy enter left "Hi there! My name's Sophia. How about you?"
dani neutral enter right "Hey, I'm Dani."

choice:
	"Start over":
		jump jump_location
	"Continue":
		pass
	"Jump ahead":
		set jumped_ahead true
		jump next_jump_point 

sophia "Well, let's continue."
mark next_jump_point

if jumped_ahead:
	background steppes
	dani "Did you jump ahead?" 
	scene next_scene

"And thus started Sophia and Dani's journey."

transition fade_out

... will produce this token list:

https://pastebin.com/EFmGaAMm

(the { type = " ", value = "" } tokens store the \t tab character in the type property)

The tokenize_symbol is quite bloated right now so I'll try to split it up more, specifically the last part that determines the type of symbol the function has acquired.

Looking forward to your feedback.

@NathanLovato
Copy link
Contributor

Don't hesitate to directly open a pull request with your work-in-progress, it'll be easier to check and review changes.

Please use full words for function and variable names, i.e. no abbreviations like prev for previous. We favor code readability.

If you open a pull request, I can directly rename variables etc. for you.

Also, you don't need to put so many comments. In particular, avoid comments that paraphrase the code, like

# Class to store a general token
class Token:
    # The token's type
    var type: String
 
    # The token's value
    var value = ""

The comments above are saying exactly what the code says. I suppose they're for you. If so, it's fine, but I'd recommend you to work without comments, so you always force your mind you read the code directly and not read and think in English, focusing your attention on comments. You'll improve your programming skills faster that way.

Regarding indents, I don't think they should be individual tokens. You can instead add an indent key to all tokens that stores a number, the indentation level./

It'll make it easier to detect errors: for example, after starting a block, you expect the indent level to go up by one. In that case, you could directly compare the previous and next tokens' indentation value.

It'll also help to detect the end of a block: it's when the indentation goes down by one.

That'll simplify parsing, I think.

@ameamenoame
Copy link
Contributor Author

ameamenoame commented Mar 8, 2021

I usually try to keep the comments low, the over-explaining bug got me there. I've trimmed the comments down and changed the variable names in the new PR #11

What's in the PR:

  • Choice/If indented blocks are tracked with matching BeginBlock and EndBlock tokens. Creating a new block increments the current indentation level (leftmost/no indent counts as 0) that's tracked by the lexer; a block ends when the current line's indent level is lower than the tracked indentation level. It's similar to the way Python handles indent so I thought to do it that way instead of using an indent key as you asked. Is that okay with you?
  • Function tokens no longer have the argument tokens coupled with them. I think grouping the function and its arguments together is the parser's job, not the lexer.
  • Support for comments, simple boolean operators (no grouping expressions with parentheses yet...) tokens.
  • Newlines are tokenized. They serve as delimiters for the parser mostly, for example, the parser will stop searching for a function's arguments when it hits the newline token.
  • Fixes for things that didn't work before. Using \ to escape in string literals didn't actually work before!

With this script:

background industrial_building fade_in
mark jump_location

sophia happy enter left "Hi there! My name's Sophia. How about you?"
dani neutral enter right "Hey, I'm Dani."

# Comment
choice:
	"Start over":
		# Some empty lines below...
		
			
		if !player_choice && another_player_choice:
			"Starting over..."
		else:
			"choice unfulfilled"
			dani surprised "Really!?"
		jump jump_location
	"Continue":
		pass
	"Jump ahead":
		set jumped_ahead true
		jump next_jump_point

sophia "Well, let's continue." # Inline comment
mark next_jump_point

if jumped_ahead:
	background steppes
	dani "Did you jump ahead?"
	scene next_scene
elif another_choice:
	dani "I was about to say something..."
	dani happy "But I forgot... Haha!"

"And thus started Sophia and Dani's journey."

transition fade_out

the current lexer code will produce this token list:
https://pastebin.com/emHpZ9bc

What do you think? If you think the lexer's is ready, I will start working on the actual parser. You can expect the parser to be done on next Monday by the latest.

@NathanLovato
Copy link
Contributor

Great work so far, I left you a note on the PR for a change I'd like to the syntax. Then you're good to move forward.

@ameamenoame
Copy link
Contributor Author

Sorry for not giving any updates for the past few days. The parser is in progress but it hasn't reached a state where I'm comfortable having code reviews for it just yet. The parser code right now is more my attempt at sketching out the overall logic for the parser than actual production code, it is extremely messy (300+ lines now with some excessive classes) and the logic still fail at some places (boolean operators aren't handled yet, choice blocks are parsed properly but if blocks aren't yet.)

In short, it's a bit too early for you to review right now. However, I'll start the clean-up process soon and you can expect some production-quality code tomorrow.

@NathanLovato
Copy link
Contributor

NathanLovato commented Mar 12, 2021

Thanks much for the update, much appreciated!

Don't be afraid to push work-in-progress code: you can mark your PR as draft. That is, if and when you want a second pair of eyes on it.

Note I'll be away from tomorrow afternoon to Tuesday night so I may not review the code until then.

@ameamenoame
Copy link
Contributor Author

ameamenoame commented Mar 13, 2021

I've made the draft PR: #12

The most notable thing in here is: Instead of having the parser directly produce a .scene file, I've decided to just have the parser focus on creating a syntax tree and then have the tree fed into a transpiler that'll actually create the .scene file. The reason is that the parser code is already complex enough as it is, trying to do both parsing through the token list and determining what to do with the resulting expressions will make for a huge amount of code in a single file.

Right now the parser can produce an almost complete syntax tree (nested choice & ifs seem to parsed correctly also), only AND and OR operators can't be parsed properly just yet. The transpiler right now can only handle dialogue lines and commands.

With this script:

background industrial_building fade_in
mark jump_location

sophia happy enter left "Hi there! My name's Sophia. How about you?"
dani neutral enter right "Hey, I'm Dani."

# Comment
choice:
	"Start over":
		if not player_choice:
			"Starting over..."
		else:
			"choice unfulfilled"
			dani surprised "Really!?"
		jump jump_location
	"Continue":
		pass
	"Jump ahead":
		set jumped_ahead true
		jump next_jump_point

sophia "Well, let's continue." # Inline comment
mark next_jump_point

if jumped_ahead:
	background steppes
	dani "Did you jump ahead?"
	if not_jumped_ahead:
		sophia "Guess I didn't"
		if really_jumped_ahead:
			sophia "I said I didn't!"
	scene next_scene
elif another_choice:
	dani "I was about to say something..."
	dani happy "But I forgot... Haha!"
else: 
	dani "I feel like we just experienced a major divergence in the timeline..."
	dani "Or maybe I'm just over-thinking things again... Urgh..."

"And thus started Sophia and Dani's journey."

transition fade_out

... the parser and transpiler will produce:
https://pastebin.com/0mLxh1h1

What do you think? I do have some doubts about the transpiler actually complicating things more and that the parser should just target the .scene format directly as you said.

The code is a bit messy (token types are still retrieved directly from the SceneLexer, stringly typed values, style issues, etc.) but by Monday, you can expect the parser and the transpiler, and the task of keeping track of player choices (I'll add some code in the ScenePlayer interpreter to save any declared variables to the save file) to be finished.

@NathanLovato
Copy link
Contributor

NathanLovato commented Mar 13, 2021

I'll be away from this afternoon to Wednesday just so you know.

The idea just to produce a dictionary the ScenePlayer can read.

So you don't have to produce a text file that represents a dictionary only to be loaded and converted back to a dictionary. You can directly produce the final data and remove the parts that convert everything to strings.

That should simplify the work a bit.

As to doing that transpile step... The code looks more complex than it needs to be at a glance. Considering this language's scope, perhaps you can simplify the process.

An abstract syntax tree allows you to do compilation, analysis, and optimization, but here the goal is just to output data: a big dictionary the ScenePlayer will read.

If possible, I'd directly produce the final dictionary. But without spending time playing with the code, I can't say whether it's easy to do or not, or whether it's the best solution.

I'll let you see what you can do and review the code in greater detail on Wednesday.

But at least, you don't have to produce a text file: you can directly create a dictionary. With your parser in place, we'll remove the scene files entirely from the project as we won't need them anymore.

@ameamenoame
Copy link
Contributor Author

I'll re-think the transpiler and try to have another go at having the parser directly produce the dictionary for the ScenePlayer. I do hope that the transpiler will not be needed.

@ameamenoame
Copy link
Contributor Author

I've decided to go with a transpiler that turns the syntax tree from the parser directly to a dictionary for the ScenePlayer.

The parser's mostly finished. You can write a script like this:

mark jump_location

background industrial_building fade_in


sophia happy enter left "Hi there! My name's Sophia. How about you?"
dani neutral enter right "Hey, I'm Dani."

# Comment
choice:
	"Start over":
		"Starting over already?"

		set started_over true

		jump jump_location
	"Continue":
		sophia happy "Well, let's continue." # Inline comment
		set jumped_ahead false
	"Jump ahead":
		dani neutral "Jumping ahead?"
		sophia "Why not? Let's go!"
		set jumped_ahead true

		# Reset
		set started_over false

		# scene next_scene
		jump end_scene

"And thus started Sophia and Dani's journey."

mark end_scene

transition fade_out

and have the game actually do what the script tells it to do.

Any declared variables are stored in the save file. That's handled in the ScenePlayer. The ScenePlayer is mostly unchanged, most of the tweaks I made are so the checks won't go crazy on commands with optional parameters.

I had the most trouble with choice, if and their code blocks, specifically when they are nested. The behaviour right now for nested code blocks is that they work if you don't end a code block with a choice or if. My difficulty with this is also why the pass command still doesn't work yet and why you can't create jump points inside any code block.

I'll send you a PayPal invoice once you deem the code to be finished.

@ameamenoame
Copy link
Contributor Author

I had the most trouble with choice, if and their code blocks, specifically when they are nested. The behaviour right now for nested code blocks is that they work if you don't end a code block with a choice or if. My difficulty with this is also why the pass command still doesn't work yet and why you can't create jump points inside any code block.

I've pushed simple fixes to these today. Nested conditionals, choices now work properly. The pass command is mostly syntactic sugar by now so you can write this:

choice:
	"Choice 1":
		"Something"
	"Choice 2":
		# The block escapes properly with or without the `pass`
		# pass
		
"Continue"

...and still have it work as expected.

@NathanLovato NathanLovato linked a pull request Mar 18, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants