Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.
Dick Grune Ceriel J.H. Jacobs Parsing Techniques A Practical Guide Second Edition Dick Grune and Ceriel J.H. Jacobs Faculteit Exacte Wetenschappen Vrije Universiteit De Boelelaan 1081 1081 HV Amsterdam The Netherlands ISBN-13: 978-0-387-20248-8 e-ISBN-13: 978-0-387-68954-8 Library of Congress Control Number: 2007936901 ©2008 Springer Science+Business Media, LLC ©1990 Ellis Horwood Ltd. Preface to the Second Edition As is fit, this second edition arose out of our readers’ demands to read about new developments and our desire to write about them. Although parsing techniques is notafastmovingfield,itdoesmove. Whenthefirsteditionwenttopressin1990, therewasonlyonetentativeandfairlyrestrictivealgorithmforlinear-timesubstring parsing.Nowthereareseveralpowerfulones,coveringalldeterministiclanguages; we describe them in Chapter 12. In 1990 Theorem 8.1 from a 1961 paper by Bar- Hillel, Perles, and Shamir lay gathering dust; in the last decade it has been used to createnewalgorithms,andtoobtaininsightintoexistingones.Wereportonthisin Chapter13. Moreandmorenon-Chomsky systemsareused, especially inlinguistics.None excepttwo-levelgrammarshadanyprominence20yearsago;wenowdescribesix of them in Chapter 15. Non-canonical parsers were considered oddities for a very longtime;nowtheyareamongthemostpowerfullinear-timeparserswehave;see Chapter10. Althoughstillnotverypractical,marvelousalgorithmsforparallelparsinghave beendesignedthatshednewlightontheprinciples;seeChapter14.In1990agen- eralizedLLparserwasdeemedimpossible;nowwedescribetwoinChapter11. Traditionally, and unsurprisingly, parsers have been used for parsing; more re- cently they are also being used for code generation, data compression and logic language implementation, as shown in Section 17.5. Enough. The reader can find more developments in many places in the book and in the Annotated Bibliography inChapter18. Kees van Reeuwijk has — only half in jest — called our book “a reservation for endangered parsers”. We agree — partly; it is more than that — and we make noapologies.Several algorithmsinthisbookhaveverylimitedorjustnopractical value. We have included them because we feel they embody interesting ideas and offer food for thought; they might also grow and acquire practical value. But we alsoincludemanyalgorithmsthatdohavepracticalvaluebutaresorelyunderused; describingthemheremightraisetheirstatusintheworld. ExercisesandProblems Thisbookisnotatextbookintheschoolsenseoftheword.Fewuniversitieshave acourseinParsingTechniques,and,asstatedinthePrefacetotheFirstEdition,read- erswillhaveverydifferentmotivationstousethisbook.Wehavethereforeincluded hardlyanyquestionsortasksthatexercisethematerialcontainedwithinthisbook; readerscannodoubtmakeupsuchtasksforthemselves.Thequestionsposedinthe problemsectionsattheendofeachchapterusuallyrequirethereadertostepoutside the bounds of the covered material. The problems have been divided into three not toowell-definedclasses: • notmarked—probablydoableinafewminutestoacoupleofhours. • markedProject—probablyalotofwork,butalmostcertainlydoable. • markedResearchProject—almostcertainlyalotofwork,buthopefullydoable. Wemakenoclaimsastotherelevanceofanyoftheseproblems;wehopethatsome readers will find some of them enlightening, interesting, or perhaps even useful. Ideas, hints, and partial or complete solutions to a number of the problems can be foundinChapterA. Therearealsoafewquestionsonformallanguagethatwerenotansweredeas- ily in the existing literature but have some importance to parsing. These have been markedaccordinglyintheproblemsections. AnnotatedBibliography Forthefirstedition,we,theauthors,readandsummarizedallpapersonparsing thatwecouldlayourhandson.Seventeenyearslater,withtheincreaseinpublica- tionsandeasieraccessthankstotheInternet,thatisnolongerpossible,muchtoour chagrin.Inthefirsteditionweincludedallrelevantsummaries.Againthatisnotpos- siblenow,sincedoingsowouldhavegreatlyexceededthenumberofpagesallotted to this book. The printed version of this second edition includes only those refer- encestotheliteratureandtheirsummariesthatareactuallyreferredtointhisbook. The complete bibliography with summaries as far as available can be found on the websiteofthisbook;itincludesitsownauthorsindexandsubjectindex.Thissetup alsoallowsustolistwithouthesitationtechnicalreportsandothermaterialofpossi- blylowaccessibility.OftenreferencestosectionsfromChapter18refertotheWeb versionofthosesections;attentionisdrawntothisbycallingthem“(Web)Sections”. We do not supply URLs in this book, for two reasons: they are ephemeral and may be incorrect next year, tomorrow, or even before the book is printed; and, es- peciallyforsoftware,betterURLsmaybeavailablebythetimeyoureadthisbook. The best URL is a few well-chosen search terms submitted to a good Web search engine. EveninthelasttenyearswehaveseenanumberofPh.Dtheseswritteninlan- guagesotherthanEnglish,specificallyGerman,French,SpanishandEstonian.This choice of language has the regrettable but predictable consequence that their con- tents have been left out of the main stream of science. This is a loss, both to the authors and to the scientific community. Whether we like it or not, English is the defactostandardlanguageofpresent-dayscience.Thetimethatascientificallyin- terested gentleman of leisure could be expected to read French, German, English, Greek, Latin and a tad of Sanskrit is 150 years in the past; today, students and sci- entists need the room in their heads and the time in their schedules for the vastly increased amount of knowledge. Although we, the authors, can stillread most (but not all) of the above languages and have done our best to represent the contents of thenon-Englishthesesadequately,thiswillnotsufficetogivethemtheinternational attentiontheydeserve. TheFutureofParsing,akaTheCrystalBall Iftherewilleverbeathirdeditionofthisbook,weexpectittobesubstantially thinner (except for the bibliography section!). The reason is that the more parsing algorithmsonestudiesthemoretheyseemsimilar,andthereseemstobegreatop- portunity for unification. Basically almost all parsing is done by top-down search withleft-recursionprotection;thisistrueevenfortraditionalbottom-uptechniques like LR(1), where the top-down search is built into the LR(1) parse tables. In this respect it is significant that Earley’s method is classified as top-down by some and asbottom-upbyothers.Thegeneralmemoizingmechanismoftabularparsingtakes the exponential sting out of the search. And it seems likely that transforming the usual depth-first search into breadth-first search will yield many of the generalized deterministicalgorithms;inthisrespectwepointtoSikkel’sPh.Dthesis[158].To- gether this seems to cover almost all algorithms in this book, including parsing by intersection.Purebottom-upparserswithoutatop-downcomponentarerareandnot verypowerful. Sointhetheoreticalfutureofparsingweseeconsiderablesimplificationthrough unificationofalgorithms;therolethatparsingbyintersectioncanplayinthisisnot clear. The simplification does not seem to extend to formal languages: it is still as difficulttoprovetheintuitivelyobviousfactthatallLL(1)grammarsareLR(1)asit was35yearsago. Thepracticalfutureofparsingmaylieinadvancedpatternrecognition,inaddi- tiontoitstraditionaltasks;thepracticalcontributionsofparsingbyintersectionare againnotclear. Amsterdam,Amstelveen DickGrune June2007 CerielJ.H.Jacobs Preface to the First Edition Parsing(syntacticanalysis)isoneofthebestunderstoodbranchesofcomputersci- ence.Parsersarealreadybeingusedextensivelyinanumberofdisciplines:incom- puter science (for compiler construction, database interfaces, self-describing data- bases, artificial intelligence), in linguistics (for text analysis, corpora analysis, ma- chinetranslation,textualanalysisofbiblicaltexts),indocumentpreparationandcon- version,intypesettingchemicalformulaeandinchromosomerecognition,toname afew;theycanbeused(andperhapsare)inafarlargernumberofdisciplines.Itis thereforesurprisingthatthereisnobookwhichcollectstheknowledgeaboutpars- ingandexplainsittothenon-specialist.Partofthereasonmaybethatparsinghasa nameforbeing“difficult”.IndiscussingtheAmsterdamCompilerKitandinteach- ingcompilerconstruction,ithas,however,beenourexperiencethatseeminglydiffi- cultparsingtechniquescanbeexplainedinsimpleterms,giventherightapproach. Thepresentbookistheresultoftheseconsiderations. This book does not address a strictly uniform audience. On the contrary, while writingthisbook,wehaveconsistentlytriedtoimaginegivingacourseonthesubject to a diffuse mixture of students and faculty members of assorted faculties, sophis- ticatedlaymen,theavidreadersofthesciencesupplementofthelargenewspapers, etc.Suchacoursewasnevergiven;adiverseaudiencelikethatwouldbetoouncoor- dinatedtoconveneatregularintervals,whichiswhywewrotethisbook,toberead, studied,perusedorconsultedwhereverorwheneverdesired. Addressing such a varied audience has its own difficulties (and rewards). Al- thoughnoexplicitmathwasused,itcouldnotbeavoidedthatanamountofmath- ematical thinking should pervade this book. Technical terms pertaining to parsing haveofcoursebeenexplainedinthebook,butsometimesatermonthefringeofthe subject has been used without definition. Any reader who has ever attended a lec- tureonanon-familiarsubjectknowsthephenomenon.Heskipstheterm,assumesit referstosomethingreasonableandhopesitwillnotrecurtoooften.Andthenthere will be passages where the reader will think we are elaborating the obvious (this paragraph may be one such place). The reader may find solace in the fact that he doesnothavetodoodlehistimeawayorstareoutofthewindowuntilthelecturer progresses. Onthepositiveside,andthatisthemainpurposeofthisenterprise,wehopethat by means of a book with this approach we can reach those who were dimly aware oftheexistenceandperhaps oftheusefulnessofparsingbutwhothought itwould foreverbehiddenbehindphraseslike: LetPbeamappingVN −Φ→2(VN∪VT)∗ andHahomomorphism. Noknowledge ofanyparticularprogramminglanguageisrequired.Thebookcon- tainstwoorthreeprogramsinPascal,whichserveasactualizationsonlyandplaya minorroleintheexplanation.Whatisrequired,though,isanunderstandingofalgo- rithmicthinking,especiallyofrecursion.BookslikeLearningtoprogrambyHoward Johnston(Prentice-Hall,1985)orProgrammingfromfirstprinciplesbyRichardBor- nat (Prentice-Hall 1987) provide an adequate background (but supply more detail than required). Pascal was chosen because it is about the only programming lan- guagemoreorlesswidelyavailableoutsidecomputerscienceenvironments. Thebookfeaturesanextensiveannotatedbibliography.Theuserofthebibliogra- phyisexpectedtobemorethancasuallyinterestedinparsingandtopossessalready areasonableknowledgeofit,eitherthroughthisbookorotherwise.Thebibliogra- phyasalistservestoopenupthemoreaccessiblepartoftheliteratureonthesubject to the reader; the annotations are in terse technical prose and we hope they will be usefulassteppingstonestoreadingtheactualarticles. Onthesubjectofapplicationsofparsers,thisbookisvague.Althoughwesug- gestanumberofapplicationsinChapter1,welacktheexpertisetosupplydetails. Itisobviousthatmusicalcompositionspossessastructurewhichcanlargelybede- scribedbyagrammarandthusisamenabletoparsing,butweshallhavetoleaveit tothemusicologiststoimplementtheidea.Itwaslessobvioustousthatbehaviour atcorporatemeetingsproceedsaccordingtoagrammar,butwearetoldthatthisis soandthatitisasubjectofsocio-psychologicalresearch. Acknowledgements Wethankthepeoplewhohelpedusinwritingthisbook.MariondeKriegerhas retrievedinnumerablebooksandcopiesofjournalarticlesforusandwithoutheref- forttheannotatedbibliographywouldbemuchfurtherfromcompleteness.EdKeizer haspatientlyrestoredpeacebetweenusandthepic|tbl|eqn|psfig|troffpipeline,onthe manyoccasionswhenweabused,overloadedorjustplainlymisunderstoodthelatter. LeovanMoergestelhasmadethehardwaredothingsforusthatitwouldnotdofor the uninitiated. We also thank Erik Baalbergen, Frans Kaashoek, Erik Groeneveld, GercoBallintijn,JacoImthorn,andEgonAmadafortheircriticalremarksandcon- tributions.TheroseattheendofChapter2isbyArwenGrune.IlanaandLilyGrune typedpartsofthetextonvariousoccasions. WethanktheFaculteitWiskundeenInformaticaoftheVrijeUniversiteitforthe useoftheequipment. Inawidersense,weextendourthankstothehundredsofauthorswhohavebeen sokindastoinventscoresofcleverandelegantalgorithmsandtechniquesforusto exhibit.Wehopewehavenamedthemallinourbibliography. Amsterdam,Amstelveen DickGrune July1990 CerielJ.H.Jacobs Contents PrefacetotheSecondEdition . v PrefacetotheFirstEdition . xi 1 Introduction. 1 1.1 ParsingasaCraft. 2 1.2 TheApproachUsed . 2 1.3 OutlineoftheContents. 3 1.4 TheAnnotatedBibliography . 4 2 GrammarsasaGeneratingDevice . 5 2.1 LanguagesasInfiniteSets . 5 2.1.1 Language. 5 2.1.2 Grammars . 7 2.1.3 ProblemswithInfiniteSets . 8 2.1.4 DescribingaLanguagethroughaFiniteRecipe . 12 2.2 FormalGrammars . 14 2.2.1 TheFormalismofFormalGrammars. 14 2.2.2 GeneratingSentencesfromaFormalGrammar . 15 2.2.3 TheExpressivePowerofFormalGrammars. 17 2.3 TheChomskyHierarchyofGrammarsandLanguages. 19 2.3.1 Type1Grammars. 19 2.3.2 Type2Grammars. 23 2.3.3 Type3Grammars. 30 2.3.4 Type4Grammars. 33 2.3.5 Conclusion . 34 2.4 ActuallyGeneratingSentencesfromaGrammar. 34 2.4.1 ThePhrase-StructureCase . 34 2.4.2 TheCSCase . 36 2.4.3 TheCFCase . 36 2.5 ToShrinkorNotToShrink . 38 2.6 GrammarsthatProducetheEmptyLanguage . 41 2.7 TheLimitationsofCFandFSGrammars. 42 2.7.1 TheuvwxyTheorem. 42 2.7.2 TheuvwTheorem. 45 2.8 CFandFSGrammarsasTransitionGraphs. 45 2.9 HygieneinContext-FreeGrammars . 47 2.9.1 UndefinedNon-Terminals. 48 2.9.2 UnreachableNon-Terminals . 48 2.9.3 Non-ProductiveRulesandNon-Terminals . 48 2.9.4 Loops. 48 2.9.5 CleaningupaContext-FreeGrammar . 49 2.10 SetPropertiesofContext-FreeandRegularLanguages . 52 2.11 TheSemanticConnection. 54 2.11.1 AttributeGrammars . 54 2.11.2 TransductionGrammars . 55 2.11.3 AugmentedTransitionNetworks . 56 2.12 AMetaphoricalComparisonofGrammarTypes. 56 2.13 Conclusion. 59 3 IntroductiontoParsing . 61 3.1 TheParseTree. 61 3.1.1 TheSizeofaParseTree . 62 3.1.2 VariousKindsofAmbiguity . 63 3.1.3 LinearizationoftheParseTree . 65 3.2 TwoWaystoParseaSentence. 65 3.2.1 Top-DownParsing . 66 3.2.2 Bottom-UpParsing. 67 3.2.3 Applicability. 68 3.3 Non-DeterministicAutomata. 69 3.3.1 ConstructingtheNDA . 70 3.3.2 ConstructingtheControlMechanism. 70 3.4 RecognitionandParsingforType0toType4Grammars . 71 3.4.1 TimeRequirements . 71 3.4.2 Type0andType1Grammars . 72 3.4.3 Type2Grammars. 73 3.4.4 Type3Grammars. 75 3.4.5 Type4Grammars. 75 3.5 AnOverviewofContext-FreeParsingMethods . 76 3.5.1 Directionality . 76 3.5.2 SearchTechniques . 77 3.5.3 GeneralDirectionalMethods . 78 3.5.4 LinearMethods. 80 3.5.5 DeterministicTop-DownandBottom-UpMethods . 82 3.5.6 Non-CanonicalMethods . 83 3.5.7 GeneralizedLinearMethods. 84