Sections in this category

Transforming a Column: Keyword Analysis

  • Updated

Sisu allows you to transform Column in your data to analyze results based on keywords within your data.

 

Related Articles:

 

Using Keyword Analysis in Columns

Using keyword phrases is an advanced feature that can be used to parse through text data or get around difficult aggregations in many-to-one relationships. This feature operates by going through all of the text data provided, breaking the text apart by the supplied delimiter, and then treating each unique string (n-gram) as its own factor.

To analyze for keywords in a Column:

  1. Open any Analysis, and click Manage Columns and select a string field.

    Select_string_column.png

  2. Toggle on the Split into keyword phrases option.

    Split_into_keyword_phrases.png

You will be presented with the following options:

Token Delimiter

The character that defines what separates the values that need to be parsed. In raw text data, this would be a space (" "). In aggregated lists, this would be whatever the defined delimiter was. The 

Max Words per Phrase

Defines the number of values/words/n-grams that will be combined into a single Sisu factor.


Example:
"The quick cat ran" broken into 3 max words would produce "The quick cat" and "quick cat ran", as well as all 2-word and 1-word values.

Normalize text
*Only applicable if the delimiter is a space. 

Only available if the delimiter is a space.
Check this box to change all words to lowercase and remove apostrophes.

Example:
"You’ll love this movie!" will be converted to "youll love this movie!".

Remove punctuation
*Only applicable if the delimiter is a space. 

Only available if the delimiter is a space.
Check this box to automatically remove all punctuation such as periods, commas, etc.

Example:
"You’ll love this movie!" will be converted to "You’ll love this movie".

Remove common
words
*Only applicable if the delimiter is a space. 

Check this box to remove the most common words to improve results accuracy — it can only be used with token delimiter set to “space” and normalize text turned on.

Example:
"I did not enjoy this movie" will be converted to "not enjoy movie".

 

There are two situations that keyword analysis is designed for:

  • Long text: Use keyword analysis to parse through raw text to determine the most impactful n-grams.

    For example, if you have a dataset that contains survey comments from users with issues about Product A, using keywords can determine that the text combination "lid snapped" is particularly impactful on low CSAT scores.

  • Many-to-one aggregations:  When dealing with data of different granularity, it is difficult to aggregate string data without losing analytic capability. Keyword analysis helps this problem by retaining all of the lower granularity data in the form of a list.

    The below dataset is an example of using LISTAGG to retain all the hypothetical shows that a user watched even though the dataset is at a higher grain. In this case, when you want to aggregate data from an order level to a user level, typically you can do AVG, SUM, MEDIAN, MAX, MIN as aggregate methods for a numerical column (like view count, or average session time, for a user). However, for a string, you cannot aggregate that easily -- so we use LISTAGG to retain the richness of the data by concatenating them, separated by a delimiter.

    Keyword analysis can be used to parse through all the lower grain data and determine if a specific lower grain factor is impacting the metric.

    info_icon.png

    LISTAGG is a function that transforms values from a group of rows into a list of values that are delimited by a configurable separator. This function might be different for each data warehouse, if your data is connected to an external warehouse. LISTAGG works for Amazon Redshift, Snowflake, and Google Big Query.

    For more info, refer to the following:

    LISTAGG is typically used to denormalize rows into a string of comma-separated values (CSV) or other comparable formats suitable for human reading.

List_of_shows_watched.png

 

Long Text Keyword Analysis 

Using a space as the token delimiter is useful when you want to analyze columns with lots of text, such as a comment field. This analysis requires higher maximum words per phase, which can be very useful in parsing through phrases within long sentences or paragraphs. Here’s an example:

comments.png

tip_icon.png Note that using a space as the token delimiter gives you access to two more options:  remove punctuation and remove common words. It also allows you to look at a longer phrase such as “chocolate granola bars”. This helps you (and Sisu) focus on the more important aspects of the column’s content.

 

In this example, Sisu might return a 1st-order fact for all the times the word “awesome” appears in comments (looking at rows 1, 3, and 4), and another 1st-order fact for how many times the word “chocolate” appears (looking at rows 2 and 4). 

It might also return a 2nd-order fact for comments that contain both words (row 4).

tip_icon.png Note that in line 3, the word “Awesome” begins with a capital letter and in row 1 it is lower case. If you do not check the Normalize text box, Sisu will consider these separately and return separate facts for each.



Many-to-One Aggregation Keyword Analysis

The following describes a scenario using keyword analysis in a General Performance Analysis with the Metric “Average Order Value” for various SKUs. To begin, we will consider one-word factors only.

To back up a bit, keyword analysis enables a different type of factor. A “normal” factor might be: “gender=male” or “state=CA”. A keyword factor might be:  “list_ske contains ‘sku23’” or “list_flavor contains ‘chocolate’”.

In our example, the Split into keyword phrases settings use a commas as the token delimiter, and max and min words per phrase set as 1 (which is recommended to start), and our data looks like this:

example.png

 

In this example, without using the keywords option, Sisu would not recognize “sku23” (for example) as an individual option since it is “combined” with other SKUs in each row. In other words, Sisu would recognize the full string (“sku23,sku15,sku9,sku15,sku56”) as one long string. With the option on, Sisu looks into each row and finds each occurrence (split by a comma) of “sku23” and can then report on that SKU individually.

Let’s dig a little deeper. Considering that our max words are set at 1, Sisu considers the data this way:

1st_order_1_word.png

 

Sisu found each occurrence of “sku23” -- even in rows that contained additional SKUs -- and found that the Average Order Value for SKU 23 is $61. Sisu went on to do the same for all SKUs found in the data. In this case, Sisu looks at each SKU as “one word”.

Next, if the max words are set to 2, Sisu considers the data this way:

1st_order_2_words.png

 

Sisu looks at two SKUs next to each other, because order matters in 1st-order facts. For example in row 1 of the data, order matters. Sisu considers “sku23,sku15” as “one word”, and “sku15,sku9” and another, and “sku9,sku15” as another, and “sku15,sku56” as the last entity in the row. Sisu then applies the metric to each of these. 

tip_icon.png

Note that this is only useful, therefore, if the sequence of the data matters. For example, if you are analyzing the sequence of user actions, as in first the user clicked on this, then immediately afterward they clicked on that.

Another example where this would be helpful is if you are doing free text analysis on a long comment or long text. (Refer to the “Long Text Keyword Analysis” section above.) In this case you should set a higher max words per phrase in order to analyze those longer phrases.


Note that rows 1 and 4 each contain “sku23” and “sku15”. However, Sisu will only count the occurrence in row 1, because order matters in 1st-order facts. Sisu is looking at two SKUs at a time (because our setting is 2 words), but only when they are next to each other in the order. In our example, “sku15,sku9” appears in both rows 1 and 3, and will both be used in the AOV for that combination.

So far you’ve been looking at 1st-order facts, which are facts with a single factor, even if the factor contains a sequence of words. Now let’s consider 2nd-order facts, which are facts that use multiple factors to describe the fact. (Refer to Exploring Grouped Facts for more information.)

In this example, the 2nd-order fact is orders containing “sku23” AND “sku15”:

2nd_order_1_word.png

 

To calculate the AOV for facts containing these two items, in our data Sisu will look at rows 1 and 4, since both of those transactions contain those SKUs. 

tip_icon.png It is important to note that the order in which the SKUs appear within each transaction does not matter in 2nd-order facts. This is in contrast to the example of 1st-order fact or a 2-word analysis, in which the order does matter in the analysis.