Input Conversion¶
A module of helper functions for dealing with various inputs.
-
yalign.input_conversion.
generate_documents
(filepath, m=20, n=20)¶ Document generator. Documents are created from the parallel corpus and will be between m and n lines long.
-
yalign.input_conversion.
html_to_document
(html, language='en')¶ Returns html text as list of Sentences
-
yalign.input_conversion.
parallel_corpus_to_documents
(filepath)¶ Transforms a parallel corpus file format into two documents. The Parallel corpus has:
- One sentences per line.
- One line of each language.
- Sentences are tokenized and tokens are space separated.
- The file encoding is UTF-8
For example:
This is a sentence . Esto es una oración . And this , my friend , is another . Y esta , mi amigo , es otra .
-
yalign.input_conversion.
parse_training_file
(training_file)¶ Reads SentencePairs from a training file.
-
yalign.input_conversion.
srt_to_document
(text, lang='en')¶ Convert a string of srt into a list of Sentences.
-
yalign.input_conversion.
text_to_document
(text, language='en')¶ Returns string text as list of Sentences
-
yalign.input_conversion.
tmx_file_to_documents
(filepath, lang_a=None, lang_b=None)¶ Converts a tmx file into two lists of Sentences. The first for language lang_a and the second for language lang_b.
-
yalign.input_conversion.
tokenize
(text, language='en')¶ Returns a Sentence with Words (ie, a list of unicode objects)