A quick and dirty grammar check in NLTK

So, while working on my app to generate worksheets (update: I’m using it! Like, in practice!) I’ve wanted a way to identify the grammar in a sentence. A quick Google search on doing this in NLTK came up mostly with diagramming sentences and looked more complicated than I am ready to tackle. (Though I will learn!)

So, with a bit of poking around, I found a method that I liked. It works with NLTK’s pos_tag function. And that, in turn, works like this:

>>> import nltk
>>> t = "This is a test."
>>> tokens = nltk.word_tokenize(t)
>>> nltk.pos_tag(tokens)
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('test', 'NN'), ('.', '.')]

NLTK actually includes a help interface to help with the ‘DT’ and ‘VBZ’ tags. It didn’t take me long to put together a few ‘patterns’ that basically represented the grammars I was looking for:

self.grammar_lookup = {
            'SIMPLE_PAST': ['VBD', "VBD-did/RB-n't/VB", "VBD-did/VB", "VBD-did/RB-not/VB"],
            'SIMPLE_PRESENT': ['VBZ', 'VBP', "VBP-do/RB-n't/VB", "VBP-do/RB-not/VB", "VBZ-does/RB-n't/VB",
                               "VBZ-does/RB-not/VB"],
            'SIMPLE_FUTURE': ['MD-will/VB', "MD-wo/RB/VB", "MD-will/RB/VB"],
            'COMPARISON_AS': ["RB-not/RB-as/JJ/IN-as", "RB-n't/RB-as/JJ/IN-as", 'RB-as/JJ/IN-as',
                              "RB-not/RB-as/RB/IN-as", "RB-n't/RB-as/RB/IN-as", 'RB-as/RB/IN-as'],
            'COMPARISON_THAN': ["JJR/IN-than", "RBR/IN-than"]
        }

As an explanatory note, the ones with a hyphen in them represent the part of speech tag and the specific word. I did that because I want to reduce the chance of false positives.

From there, it was a fairly straightforward process to start iterating over sentences and looking for these patterns. I’m not especially proud of the code that does this and I think that any serious coder could do it more elegantly, so I’m not going to post that. Still, I’m proud of myself for finding a solution to my problem that works…

…for now. (As you can see, I’ve only tested it on five grammars. I’ll add more as I need them.)

Advertisements

2 thoughts on “A quick and dirty grammar check in NLTK

    1. My memory of it (I’ve stopped using it) is that I tokenized the sentence and then built a string from the list of tokens (Comparable to the strings in the lists in the dictionary). Then, I compared that string to the grammar I was looking for.

      Looking back, I’m less proud of it now than I was then. But I remember it working for things written to match the patterns… I just got away from dynamically creating grammar worksheets for the time being.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s