Logo

0x5a.live

for different kinds of informations and explorations.

GitHub - ParserKt/ParserKt: Naive one-pass recursive descent, scannerless parser framework for Kotlin

Naive one-pass recursive descent, scannerless parser framework for Kotlin - ParserKt/ParserKt

Visit SiteGitHub - ParserKt/ParserKt: Naive one-pass recursive descent, scannerless parser framework for Kotlin

GitHub - ParserKt/ParserKt: Naive one-pass recursive descent, scannerless parser framework for Kotlin

Naive one-pass recursive descent, scannerless parser framework for Kotlin - ParserKt/ParserKt

Powered by 0x5a.live ๐Ÿ’—

ParserKt

Introduction

ParserKt is a naive one-pass recursive descent, scannerless parser framework for Kotlin (mainly JVM, compatible for JS)

A parser is a kind of program extracting data from input sequences:

NOTE: Using REPL from command line is not a good practice, you can create a project with dependency on JitPack or use IDEA Kotlin REPL instead.

git clone https://github.com/ParserKt/ParserKt.git && cd ParserKt
gradle shadowJar
kotlinc -cp build/libs/ParserKt-*.jar

Don't be scared! ParserKt is just a small library, even full-build won't take too much time. (or you can download pre-built jars here)

// Pattern.read(String), CharInput.STDIN, ...
// item(x), elementIn('a'..'z'), satisfy<Int> { it % 2 == 0 }, ...
// asList(), asString(), ...
import org.parserkt.*
import org.parserkt.util.*
import org.parserkt.pat.*

Import parserkt, util, pat, then we can implement our own combined pattern! (for a more in depth sample, see link at header)

val digits = Repeat(asList(), elementIn('0'..'9')) //{[0-9]}
digits.read("123") //[1, 2, 3]
digits.read("12a3") //[1, 2]
digits.read("a3") //null //(= notParsed)

A parser mainly recognize, and transform inputs into a simpler form (like AST) for post-processing (preetify, evaluate, type-check, ...), and it can be used to create (DSL) language tools / compilers / transpilers / interpreters.

// Generic parsing for [1, a, b] where (b > a)
val ints = Seq(::IntTuple, item(1), *Contextual(item<Int>()) { i -> satisfy<Int> { it > i } }.flatten().items() )
ints.rebuild(1,2,3) //[1, 2, 3]
ints.rebuild(1,2,1) //notParsed
i.rebuild(1,0,1) //[1, 0, 1]
// Using Convert patterns
fun digitItem(digit: SatisfyPattern<Char>) = Convert(digit, {it-'0'}, {'0'+it})
val digit = digitItem(elementIn('0'..'9'))
val digitsInt = Repeat(JoinFold(0) { this*10 + it }, digit)
digitsInt.read("233") //233

(click here if you want to try out more)

ParserKt divides complex grammar into simple subpart val definitions. "entry rule" could be all public structure โ€” like item, elementIn, Repeat, and thus it's very easy to debug, and to reuse implemented syntax.

What means "one-pass"?

See FeedModel.kt:

interface Feed<out T> {
  val peek: T; fun consume(): T
  class End: NoSuchElementException("no more")
}

That's all one subparser can see, no mark/reset, not even one character supported for lookahead.

What means "recursive descent"?

Take a look of these expression: a + b, a + (b + c)

sealed class PlusAst {
  data class Plus(val left: PlusAst, val right: PlusAst): PlusAst()
  data class IntLiteral(val n: Int): PlusAst()
}

(sealed class are just classes with determinable subclass, in their inner name scope)

This data structure implies, every symbol of a + b could be an integer (like 0, 9, 16, ...IntLiteral(n)), or other a + b (1 + 2, 9 + 0, ...Plus(left, right))

PlusAst is recursive, so it's best to implement it's parser in recurse function form, that's "recursive".

Plus := (Plus '+' Plus) | Int  ; left associative
Int := {[0-9]}

({a} denotes repeat, a|b denotes alternative)

When reading input "1+2+3" by rule Plus, results Plus(Plus(Int(1), Int(2)), Int(3)).

Parser keep going to the state of reading substructure Int from reading Plus.

From rule Plus to its subrule Int, that's "descent".

โ€”โ€”

ParserKt cannot make any lookaheads, so it's looks like impossible to parse syntax like //, /*, or 0x 0b (and error when 0(octal)).

In fact, "lookahead" can be stored in call stack of Pattern.read by pattern Contextual, so write parser for such syntax is possible (but much more complicated, so it's better to create new Pattern subclass, or fall back to tokenizer-parser LexerFeed, or use TriePattern)

What is the difference between "parser combinator" and "parser compiler"?

Let me clarify first: I really don't know what "combinator" is :(

I think combinators must be related to "functional programming", or at least "declarative programming" โ€” Define what it is, not how computers actually solve it.

Parser combinator is mainly about code reuse, parser compiler is mainly about pattern matching algorithms.

Parser compilers and parser combinators solves the same problem in different ways. A parser compiler have better portability, and better performance (I'm not sure). A parser combinator integrates better with the host language, and it's really easy to write/debug, since it's just hand-wrote program.

For example, keywords about parser compilers: LL(k), LR(k), LALR, NFA, DFA, KMP, Ahoโ€“Corasick

:fearful: Scared? Recursive descent method are the most popular practice of handwriting parser used by many well-known projects like Lua, and it's also the best choice for code quality โ€” source files generated by parser compilers offen a mess!

Features

  • Minimal boilerplate code & high reusability
  • DSL style & fully object-oriented library design
  • Rebuild parse result back to input (REPL friendly)
  • Generic parsing input (Char, String, XXXToken, ...)
  • One-pass input stream design without hasNext
  • ReaderFeed for implementating interactive shells (JVM only)
  • The framework manages parsed data in a flexible and extensible way (Tuple2, Sized-Tuple, Fold)
  • Supports complex contextual syntax structure that cannot be supported directly in parser compilers (LayoutPattern, NumUnitPattern, ...)
  • Extensible error messages using clam {"message"}, clamWhile(pat, defaultValue) {"message"}
  • Parser input stream Feed can have state storage: withState(T), stateAs<T>()
  • <500K compiled JVM library, no extra runtime needed (except kotlin-stdlib, Kotlin sequences are optional)
  • No magics, all code has been rewritten at least 9 times by the author, ugly/ambiguous/disordered codes are removed

Great thanks to Kotlin, for its strong expressibility and amazing type inference.

Provided combinators

Modules

  • :parserkt for Feed/Input model and basic (org.parserkt.pat) / complex (org.parserkt.pat.complex) patterns
  • :parserkt-util Fundamental data types (Slice, Tuple, Trie, ...) and helper class (Preety, AnyBy, ...) used/prepared for ParserKt and friends
  • :parserkt-ext Extension library for real-world parsers, eliminating boilerplate code (LayoutPattern, LexicalBasics, ...)

abbreviations

  • POPCorn (Pattern, OptionalPattern, PatternWrapper, ConstantPattern)
  • SURDIES (Seq, Until, Repeat, Decide, item, elementIn, satisfy)
  • CCDP (Convert, Contextual, Deferred, Piped)
  • SJIT (SurroundBy, JoinBy, InfixPattern, TriePattern)

More runnable REPL snippets

Note that for frequently-used pattern combinations, we have org.parserkt.pat.ext:

// Using pat.ext and LexicalBasics
import org.parserkt.pat.ext.*
val digitsInt = Repeat(asInt(), LexicalBasics.digitFor('0'..'9'))
digitsInt.read("180") //180
// Implementing complex pattern
import org.parserkt.pat.complex.*
// Patterns with constant values are OK to perform rebuild, even ignored in parse result
val letter = elementIn('A'..'Z', 'a'..'z', '0'..'9') or item('_')
val sep = elementIn(',',':').toConstant(',') named "sep"
val xsv = JoinBy(sep, Repeat(asString(), letter)) // try replace letter using !sep
xsv.read("monkey,banana,melon") //Tuple2(first=[monkey, banana, melon], second=[,, ,])

xsv.concatCharJoin().read("็Œด:้›€:็“œ") //Tuple2(first=[็Œด, ้›€, ็“œ], second=::)
val goodXsv = xsv.mergeConstantJoin()
goodXsv.read("she,her,herself") //[she, her, herself]
goodXsv.rebuild("she,her,herself") //she,her,herself

ParserKt provides mergeFirst/Second, discardFirst/Second, flatten for Tuple2 pattern converting :kissing_heart:

import org.parserkt.*
import org.parserkt.util.*
import org.parserkt.pat.*
import org.parserkt.pat.complex.*

val dict = TriePattern<Char, String>().apply {
  mergeStrings("hello" to "ใ“ใ‚“ใซใกใฏ")
  mergeStrings("world" to "ไธ–็•Œ")
}
val noun = Repeat(asList(), dict)
noun.read("helloworld") //[ใ“ใ‚“ใซใกใฏ, ไธ–็•Œ]

val pharse = JoinBy(Decide(elementIn('0'..'9'), elementIn(' ', '\t', '\n', '\r')), dict)
pharse.read("hello6world hello") //Tuple2(first=[ใ“ใ‚“ใซใกใฏ, ไธ–็•Œ, ใ“ใ‚“ใซใกใฏ], second=[Tuple2(first=0, second=6), Tuple2(first=1, second= )])

It's OK (and also easy) to read recursive structure, just defined lateinit var and use Deferred{name} to reference:

// Back converts (third argument for Convert) are optional
sealed class Sexp { data class Term(val name: String): Sexp(); data class Nest(val list: List<Sexp>): Sexp() }
lateinit var sexp: Pattern<Char, Sexp>
val str = Repeat(asString(), !elementIn(' ', '(', ')'))
val atom = Convert(str, { Sexp.Term(it) }, { it.name })

val nestItems = SurroundBy(parens.toCharPat(), JoinBy(item(' '), Deferred{sexp}).mergeConstantJoin())
val nest = Convert(nestItems, { Sexp.Nest(it) }, { it.list })
sexp = Decide(nest, atom).mergeFirst { if (it is Sexp.Nest) 0 else 1 }

sexp.read("((monkey banana) (deep parser))") //Nest(list=[Nest(list=[Term(name=monkey), Term(name=banana)]), Nest(list=[Term(name=deep), Term(name=parser)])])
sexp.rebuild("((monkey banana) (deep parser))")  //((monkey banana) (deep parser))

NOTE: when using pattern Until, think if it can be expressed by Repeat(..., !SatisfPattern) first

References

References on Historic

(script for GitHub pages)

Kotlin Resources

are all listed below.

Resources

listed to get explored on!!

Made with โค๏ธ

to provide different kinds of informations and resources.