Input Conversion¶

A module of helper functions for dealing with various inputs.

yalign.input_conversion.generate_documents(filepath, m=20, n=20)¶: Document generator. Documents are created from the parallel corpus and will be between m and n lines long.

yalign.input_conversion.html_to_document(html, language='en')¶: Returns html text as list of Sentences

yalign.input_conversion.parallel_corpus_to_documents(filepath)¶

Transforms a parallel corpus file format into two documents. The Parallel corpus has:

One sentences per line.

One line of each language.

Sentences are tokenized and tokens are space separated.

The file encoding is UTF-8

For example:

This is a sentence . Esto es una oración . And this , my friend , is another . Y esta , mi amigo , es otra .

yalign.input_conversion.parse_training_file(training_file)¶: Reads SentencePairs from a training file.

yalign.input_conversion.srt_to_document(text, lang='en')¶: Convert a string of srt into a list of Sentences.

yalign.input_conversion.text_to_document(text, language='en')¶: Returns string text as list of Sentences

yalign.input_conversion.tmx_file_to_documents(filepath, lang_a=None, lang_b=None)¶: Converts a tmx file into two lists of Sentences. The first for language lang_a and the second for language lang_b.

yalign.input_conversion.tokenize(text, language='en')¶: Returns a Sentence with Words (ie, a list of unicode objects)