So, while working on my app to generate worksheets (update: I’m using it! Like, in practice!) I’ve wanted a way to identify the grammar in a sentence. A quick Google search on doing this in NLTK came up mostly with diagramming sentences and looked more complicated than I am ready to tackle. (Though I will learn!)
So, with a bit of poking around, I found a method that I liked. It works with NLTK’s pos_tag function. And that, in turn, works like this:
>>> import nltk >>> t = "This is a test." >>> tokens = nltk.word_tokenize(t) >>> nltk.pos_tag(tokens) [('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('test', 'NN'), ('.', '.')]
NLTK actually includes a help interface to help with the ‘DT’ and ‘VBZ’ tags. It didn’t take me long to put together a few ‘patterns’ that basically represented the grammars I was looking for:
self.grammar_lookup = { 'SIMPLE_PAST': ['VBD', "VBD-did/RB-n't/VB", "VBD-did/VB", "VBD-did/RB-not/VB"], 'SIMPLE_PRESENT': ['VBZ', 'VBP', "VBP-do/RB-n't/VB", "VBP-do/RB-not/VB", "VBZ-does/RB-n't/VB", "VBZ-does/RB-not/VB"], 'SIMPLE_FUTURE': ['MD-will/VB', "MD-wo/RB/VB", "MD-will/RB/VB"], 'COMPARISON_AS': ["RB-not/RB-as/JJ/IN-as", "RB-n't/RB-as/JJ/IN-as", 'RB-as/JJ/IN-as', "RB-not/RB-as/RB/IN-as", "RB-n't/RB-as/RB/IN-as", 'RB-as/RB/IN-as'], 'COMPARISON_THAN': ["JJR/IN-than", "RBR/IN-than"] }
As an explanatory note, the ones with a hyphen in them represent the part of speech tag and the specific word. I did that because I want to reduce the chance of false positives.
From there, it was a fairly straightforward process to start iterating over sentences and looking for these patterns. I’m not especially proud of the code that does this and I think that any serious coder could do it more elegantly, so I’m not going to post that. Still, I’m proud of myself for finding a solution to my problem that works…
…for now. (As you can see, I’ve only tested it on five grammars. I’ll add more as I need them.)
how did you implement the grammar checks?
LikeLike
My memory of it (I’ve stopped using it) is that I tokenized the sentence and then built a string from the list of tokens (Comparable to the strings in the lists in the dictionary). Then, I compared that string to the grammar I was looking for.
Looking back, I’m less proud of it now than I was then. But I remember it working for things written to match the patterns… I just got away from dynamically creating grammar worksheets for the time being.
LikeLike