Yalign Model¶
-
class
yalign.yalignmodel.
MetadataHelper
(metadata)¶ Bases:
dict
-
class
yalign.yalignmodel.
YalignModel
(document_pair_aligner=None, threshold=None, metadata=None)¶ Bases:
object
Main Yalign class. It provides methods to train a alignment model, to load a model from a folder and to align two documents.
-
align
(document_a, document_b)¶ Try to detect aligned sentences from the comparable documents document_a and document_b. The returned alignments are expected to meet the F-measure for which the model was trained for.
-
align_indexes
(document_a, document_b)¶ Same as align but returning indexes in documents instead of sentences.
-
classmethod
load
(model_directory)¶ This method to loads an existing YalignModel from the path to the folder where it’s contained.
-
optimize_gap_penalty_and_threshold
(document_a, document_b, real_alignments)¶ Given documents document_a and document_b (not necesarily aligned) and the real_alignments for that documents train the YalignModel instance to maximize the target F-measure (the quality measure).
real_alignments is a list of indexes (i, j) of document_a and document_b respectively indicating that those sentences are aligned. Pairs not included in real_alignments are assumed to be wrong alignments.
-
save
(model_directory)¶ Store a serialization of a YalignModel instance in a given folder. Metadata is stored in a separate file.
-
sentence_pair_score
¶
-
word_pair_score
¶
-
-
yalign.yalignmodel.
apply_threshold
(alignments, threshold)¶
-
yalign.yalignmodel.
basic_model
(corpus_filepath, word_scores_filepath, lang_a=None, lang_b=None)¶ Creates and trains a YalignModel with the basic configuration and default values.
corpus_filepath is the path to a parallel corpus used for training, it can be:
- a csv file with two sentences and alignement information, or
- a tmx file with correct alignments (a regular parallel corpus), or
- a text file with interleaved sentences (one line in language A, the next in language B)
word_scores_filepath is the path to a csv file (possibly gzipped) with word dictionary data. (for ex. “house,casa,0.91”).
lang_a and lang_b are requiered for the tokenizer in the case of a tmx file. In the other cases is not necesary because it’s assumed that the words are already tokenized.
-
yalign.yalignmodel.
best_threshold
(real_alignments, predicted_alignments)¶ Returns the best F score and threshold value for this gap_penalty
-
yalign.yalignmodel.
pre_filter_alignments
(alignments)¶
-
yalign.yalignmodel.
random_sampling_maximizer
(F, min_, max_, n=None)¶
-
yalign.yalignmodel.
score_with_best_threshold
(aligner, xs, ys, gap_penalty, real_alignments)¶