API

Normalization

__class__ NormalizeText

__method__ add_to_stopwords(words: list)

Add stop words to the stopwords attribute

usage:

```
from text_processing_ml.normalization import NormalizeText

normalizer = NormalizeText()
normalizer.add_to_stopwords(["Hello", "there"])
normalizer.remove_stopwords("Hello there") == ''
```

__method__ remove_stopwords(text: str) -> str

Remove stop words, those found in stopwords attribute.

usage:

```
from text_processing_ml.normalization import NormalizeText

normalizer = NormalizeText()
normalizer.add_to_stopwords(["Hello", "there"])
normalizer.remove_stopwords("Hello there") == ''
```

__method__ normalize_case(text: str) -> str

convert text to lower case

usage:

```

from text_processing_ml.normalization import NormalizeText

normalizer = NormalizeText()
normalizer.normalize_case("Hello") # returns "hello"
```

__method__ initialize_spellchecker(corpus: str, corpus_name: str, words: list, ner_text: str)

Initialize the spellchecker with text to ignore and a corpus to train on, for more directed text correction

usage:

```

from text_processing_ml.normalization import NormalizeText

with open("corpus.txt", "r") as f:
        corpus = f.read()

normalizer = NormalizeText()
normalizer.initialize_spellchecker(corpus=text, corpus_name="corpus", words=["Hello", "there"], ner_text=text)
normalizer.make_spelling_correction("Hello there friendz") # returns "Hello there friends"
```

__method__ strip_punctuation(word: str, return_punctuation: bool) -> str

strips out the punctuation from a word. If return_punctuation is True then the punctuation is returned as well, separately.

usage:

```

from text_processing_ml.normalization import NormalizeText

normalizer = NormalizeText()
normalizer.strip_punctuation("Hello,") # returns Hello
normalizer.strip_punctuation("Hello,", return_punctuation) # returns (Hello, ",")
```

__method__ correctly_spelled(word: str) -> bool

Checks if the word is correctly spelled, provided the word appears in spellchecker_ml’s dictionary of words. Returns False otherwise

usage:

```

from text_processing_ml.normalization import NormalizeText

normalizer = NormalizeText()
normalizer.correctly_spelled("Helo") #returns False
```

__method__ make_spelling_correction(text: str) -> str

Correct the spelling of a piece of text using a hidden markov model (for now).

usage:

```

from text_processing_ml.normalization import NormalizeText

normalizer = NormalizeText()
normalizer.initialize_spellchecker(corpus=text, corpus_name="corpus", words=["Hello", "there"], ner_text=text)
normalizer.make_spelling_correction("Hello there friendz") # returns "Hello there friends"
```

__method__ correct_whitespace(text: str) -> str

Normalizes the white space to one space per token.

usage:

```

from text_processing_ml.normalization import NormalizeText

normalizer = NormalizeText()
normalizer.correct_whitespace(" Hello  there friends  \t whatever")
# returns Hello there friends whatever
```

Matching

Parsing

__class__ ParseText

  • Attributes
    • stemmer - a Porter Stemmer from nltk
    • normalizer - the normalizer found elsewhere in the project

__method__ stem_tokens(tokens: list) -> list

Returns a list of stemmed tokens. Stemming is the process of getting the root word of a word.

Example:

runs -> run jumping -> jump flying -> fly

usage:

```
from text_processing_ml.parsing import ParseText

parser = ParseText()
parser.stem_tokens("Hello there friends".split()) # returns Hello there friend
```

__method__ tokenize(text: str) -> list

Tokenize and stem a string of words into stemmed tokens.

usage:

```
from text_processing_ml.parsing import ParseText

parser = ParseText()
parser.tokenize("Hello there friends") # ["Hello", "there", "friend"]
```

__method__ normalize_text(text: str) -> str

Normalize the a piece of text by lower casing it and removing punctuation

usage:

```
from text_processing_ml.parsing import ParseText

parser = ParseText()
parser.normalize_text("Hello there, friend") # returns hello there friend
```

__method__ tfidf(documents: list)

Returns the Term frequency given the inverse document frequency

usage:

```
from text_processing_ml.parsing import ParseText

parser = ParseText()
parser.tfidf(["Hello there friends", "How are you doing?", "what's up"])
# returns the term frequency matrix
```