org.olat.core.commons.services.text.impl.nutch
Class NGramProfile

java.lang.Object
  extended by org.olat.core.commons.services.text.impl.nutch.NGramProfile

public class NGramProfile
extends java.lang.Object

This class runs a ngram analysis over submitted text, results might be used for automatic language identifiaction. The similarity calculation is at experimental level. You have been warned. Methods are provided to build new NGramProfiles profiles.

Author:
Sami Siren, Jerome Charron - http://frutch.free.fr/

Field Summary
static OLog log
           
 
Constructor Summary
NGramProfile(java.lang.String name, int minlen, int maxlen)
          Construct a new ngram profile
 
Method Summary
 void add(java.lang.StringBuffer word)
          Add ngrams from a single word to this profile
 void analyze(java.lang.StringBuilder text)
          Analyze a piece of text
static NGramProfile create(java.lang.String name, java.io.InputStream is, java.lang.String encoding)
          Create a new Language profile from (preferably quite large) text file
 java.lang.String getName()
           
 float getSimilarity(NGramProfile another)
          Calculate a score how well NGramProfiles match each other
 java.util.List<org.olat.core.commons.services.text.impl.nutch.NGramProfile.NGramEntry> getSorted()
          Return a sorted list of ngrams (sort done by 1.
 void load(java.io.InputStream is)
          Loads a ngram profile from an InputStream (assumes UTF-8 encoded content)
static void main(java.lang.String[] args)
          main method used for testing only
 void save(java.io.OutputStream os)
          Writes NGramProfile content into OutputStream, content is outputted with UTF-8 encoding
 java.lang.String toString()
           
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

log

public static final OLog log
Constructor Detail

NGramProfile

public NGramProfile(java.lang.String name,
                    int minlen,
                    int maxlen)
Construct a new ngram profile

Parameters:
name - is the name of the profile
minlen - is the min length of ngram sequences
maxlen - is the max length of ngram sequences
Method Detail

getName

public java.lang.String getName()
Returns:
Returns the name.

add

public void add(java.lang.StringBuffer word)
Add ngrams from a single word to this profile

Parameters:
word - is the word to add

analyze

public void analyze(java.lang.StringBuilder text)
Analyze a piece of text

Parameters:
text - the text to be analyzed

getSorted

public java.util.List<org.olat.core.commons.services.text.impl.nutch.NGramProfile.NGramEntry> getSorted()
Return a sorted list of ngrams (sort done by 1. frequency 2. sequence)

Returns:
sorted vector of ngrams

toString

public java.lang.String toString()
Overrides:
toString in class java.lang.Object

getSimilarity

public float getSimilarity(NGramProfile another)
Calculate a score how well NGramProfiles match each other

Parameters:
another - ngram profile to compare against
Returns:
similarity 0=exact match

load

public void load(java.io.InputStream is)
          throws java.io.IOException
Loads a ngram profile from an InputStream (assumes UTF-8 encoded content)

Parameters:
is - the InputStream to read
Throws:
java.io.IOException

create

public static NGramProfile create(java.lang.String name,
                                  java.io.InputStream is,
                                  java.lang.String encoding)
Create a new Language profile from (preferably quite large) text file

Parameters:
name - is thename of profile
is - is the stream to read
encoding - is the encoding of stream

save

public void save(java.io.OutputStream os)
          throws java.io.IOException
Writes NGramProfile content into OutputStream, content is outputted with UTF-8 encoding

Parameters:
os - the Stream to output to
Throws:
java.io.IOException

main

public static void main(java.lang.String[] args)
main method used for testing only

Parameters:
args -