### API ### Normalization ============= __class__ NormalizeText * Attributes * stopwords - a list of english stopwords * spell_checker - a spell checker of type spellchecker_ml (see https://github.com:EricSchles/spellchecker_ml for more information) __method__ add_to_stopwords(words: list) Add stop words to the stopwords attribute usage:: ``` from text_processing_ml.normalization import NormalizeText normalizer = NormalizeText() normalizer.add_to_stopwords(["Hello", "there"]) normalizer.remove_stopwords("Hello there") == '' ``` __method__ remove_stopwords(text: str) -> str Remove stop words, those found in stopwords attribute. usage:: ``` from text_processing_ml.normalization import NormalizeText normalizer = NormalizeText() normalizer.add_to_stopwords(["Hello", "there"]) normalizer.remove_stopwords("Hello there") == '' ``` __method__ normalize_case(text: str) -> str convert text to lower case usage:: ``` from text_processing_ml.normalization import NormalizeText normalizer = NormalizeText() normalizer.normalize_case("Hello") # returns "hello" ``` __method__ initialize_spellchecker(corpus: str, corpus_name: str, words: list, ner_text: str) Initialize the spellchecker with text to ignore and a corpus to train on, for more directed text correction usage:: ``` from text_processing_ml.normalization import NormalizeText with open("corpus.txt", "r") as f: corpus = f.read() normalizer = NormalizeText() normalizer.initialize_spellchecker(corpus=text, corpus_name="corpus", words=["Hello", "there"], ner_text=text) normalizer.make_spelling_correction("Hello there friendz") # returns "Hello there friends" ``` __method__ strip_punctuation(word: str, return_punctuation: bool) -> str strips out the punctuation from a word. If return_punctuation is True then the punctuation is returned as well, separately. usage:: ``` from text_processing_ml.normalization import NormalizeText normalizer = NormalizeText() normalizer.strip_punctuation("Hello,") # returns Hello normalizer.strip_punctuation("Hello,", return_punctuation) # returns (Hello, ",") ``` __method__ correctly_spelled(word: str) -> bool Checks if the word is correctly spelled, provided the word appears in spellchecker_ml's dictionary of words. Returns False otherwise usage:: ``` from text_processing_ml.normalization import NormalizeText normalizer = NormalizeText() normalizer.correctly_spelled("Helo") #returns False ``` __method__ make_spelling_correction(text: str) -> str Correct the spelling of a piece of text using a hidden markov model (for now). usage:: ``` from text_processing_ml.normalization import NormalizeText normalizer = NormalizeText() normalizer.initialize_spellchecker(corpus=text, corpus_name="corpus", words=["Hello", "there"], ner_text=text) normalizer.make_spelling_correction("Hello there friendz") # returns "Hello there friends" ``` __method__ correct_whitespace(text: str) -> str Normalizes the white space to one space per token. usage:: ``` from text_processing_ml.normalization import NormalizeText normalizer = NormalizeText() normalizer.correct_whitespace(" Hello there friends \t whatever") # returns Hello there friends whatever ``` Matching ======== Parsing ======= __class__ ParseText * Attributes * stemmer - a Porter Stemmer from nltk * normalizer - the normalizer found elsewhere in the project __method__ stem_tokens(tokens: list) -> list Returns a list of stemmed tokens. Stemming is the process of getting the root word of a word. Example: runs -> run jumping -> jump flying -> fly usage:: ``` from text_processing_ml.parsing import ParseText parser = ParseText() parser.stem_tokens("Hello there friends".split()) # returns Hello there friend ``` __method__ tokenize(text: str) -> list Tokenize and stem a string of words into stemmed tokens. usage:: ``` from text_processing_ml.parsing import ParseText parser = ParseText() parser.tokenize("Hello there friends") # ["Hello", "there", "friend"] ``` __method__ normalize_text(text: str) -> str Normalize the a piece of text by lower casing it and removing punctuation usage:: ``` from text_processing_ml.parsing import ParseText parser = ParseText() parser.normalize_text("Hello there, friend") # returns hello there friend ``` __method__ tfidf(documents: list) Returns the Term frequency given the inverse document frequency usage:: ``` from text_processing_ml.parsing import ParseText parser = ParseText() parser.tfidf(["Hello there friends", "How are you doing?", "what's up"]) # returns the term frequency matrix ```