Skip to content

Porting word2phrase #1

@Prog19

Description

@Prog19

A quick solution to this issue from the Java implementation would be downloading this code file (from the original C tool) and compiling, and executing it from Clojure. This marks the multi-word phrases with an underscore in between in the training text corpus. (Refer 'From words to phrases and beyond' from here)

Below is the code to run the executable in /resources in the project directory using Java Runtime instance and alternatively, by shelling out in Clojure. Here, the input is placed in /resources/train.txt, the output may be found at /resources/output/out.txt and the other parameters to the word2phrase training take default values.

(import '(java.lang Runtime Process))
(import '(java.io BufferedReader InputStreamReader))
(use '[clojure.java.shell :only [sh]])

(let [tmp (-> (System/getProperty "user.dir")
              (.replace "\\" "/")) ;File path modified for Unix. 
                ;Windows accepts both style file paths.
      res (str tmp "/resources/")]
    (comment
    (let [proc (.(Runtime/getRuntime) exec (str res "word2phrase.exe
                                              -train " res "train.txt
                                              -output " res "output/out.txt"))
          br (BufferedReader. (InputStreamReader. (.getInputStream proc)))]
        (println (clojure.string/join "\n" (line-seq br)))
        (.close br)))

    (println (:out (sh (str res "word2phrase.exe")
                          "-train" (str res "train.txt")
                          "-output" (str res "output/out.txt"))))
    (System/exit 0))    

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions