Problem: Identify Topics In A Spreadsheet of Texts

To make it more interesting let’s say you want to preserve an ID assigned to each row in the original document

Solution: Use the PreCeive Batch utility

Method:

Working on our sample spreadsheet in preceive-cli/src/dist/corpus/100-texts.xlsx
You identify a couple of topics you’d like to identify. This sheet contains articles around Immigration, Art, Letters and more.

At TheySay we have some generic Topics that we identify such as Defence, Finance, Health, Property, Technology & Terrorism.
See the current list on Apiary
In this case we need to create some more specific rules to perform a more fine-grained topic detection.
There are 2 ways to do this:-

  1. Upload individual Topic expressions to your PreCeive account as explained in http://docs.theysay.apiary.io/#reference/0/resources-topics/uploading-a-weighted-topic-expression
  2.  

  3. Create a CSV with our topic rules and use the preceive-api.jar to upload these rules:-
    java -jar "PATH/TO/PRECEIVE/preceive-api.jar" -user '{USERNAME}' -pwd '{PASSWORD}' -set-topic-keywords /path/to/TopicDetectionRules.csv

We decide we want to pick out ART and IMMIGRATION from these texts, so we use the following topic rules:-

Label Keywords Weighting
Immigration asylum seeker 10
Immigration asylum seekers 10
Immigration immigration 10
Art art 10
Art painter 10
Art painting 10

When the “Topic Rules” have been uploaded, we can perform the Topic Detection and Sentiment Analysis with the PreCeive Batch Tool:-

/path/to/preceive-cli/bin/preceive-cli \
    -user {USERNAME} -password {PASSWORD} \
    -output 100-texts_results.json \
    -endpoints document.topics=/v1/topic document.sentiment=/v1/sentiment?level=document \
    -input-id "Row ID" -input-text "TEXT" \
    -copy-input-data-as "Original_Text" \
    -batch /path/to/100-texts.xlsx

The resulting output in 100-texts_results.json will be labeled as Art or Immigration with a “confidence” calculated from the keyword weights wherever corresponding keywords are found. e.g.:-

"document": {
    "topics": {
        "scores": [{"label":"Immigration","confidence":0.195,"classifier":"CUSTOM"}]
    }
}
Andy Pritchard

Andy Pritchard