How to improve the accuracy of my topics?
Accuracy is a much-discussed topic in the field of text analysis and the one that tends to cause confusion.
Factors that affect the accuracy
It is important to keep in mind that the perception of topic accuracy is subjective and is based on the interpretation of the person looking at the results. If the same set of predictions were shown to different people there is a chance that they might not all interpret them exactly the same.
For example, if we have a topic called "Feedback", is it clear what it is supposed to include? Is it feedback about the product features, customer service, or just mentions of the word "feedback"?
The more specific the topic definition, the easier it will be to determine whether the predictions belong to the topic. Whilst there is no absolute truth in whether a given prediction is accurate if the topic is well defined, the majority of people will tend to agree on whether a given prediction belongs to a topic or not.
In this article, we will break down some of the key factors that impact the accuracy and suggest some tips for improving the accuracy of your Prodsight Topics.
Recall represents the proportion of correctly predicted instances.
For example, if you are tracking mentions related to customer cancellations and you have five instances but only three were included in the topic, the recall would be equal to 60% (3 out of 5).
Generally, you want the recall to be as high as possible so you are not missing out on any instances.
Precision represents the proportion of predicted instances that are correct.
For example, if you are tracking mentions related to customer cancellations and you have two examples that refer to cancellations and two that talk about upgrades, the precision would be equal to 50% (2 out of 4).
Generally, you want the precision to be as high as possible so that you have to spend less time reading and discarding irrelevant results.
The balance between Precision and Recall
Whilst in the ideal world we would have 100% precision and 100% recall, it is not realistic. The more precise your results are, the higher the chance that you will have some that haven't been recalled by the model. If you increase the recall, it is likely that your results will include some predictions that do not belong there, hence lowering the precision.
That's why it's important to determine what is most important to you. Do you want to make sure that as many potential matches as possible are included even if some are inaccurate? Or would you rather maximize the chances that each result is as accurate as possible even if some potential matches will be excluded?
How to improve topic accuracy?
Prodsight offers a range of options for managing your topic accuracy.
- You can improve recall by using keywords with fewer words. E.g. instead of using "cancel account" you can use "cancel" which is less specific and will likely return more matches;
- You can improve precision by using more specific or multi-word keywords. For example, instead of using "transfer" consider using "bank transfer", "wire transfer" and "money transfer";
- You can improve precision by using exact-match keywords. For example, instead of using "refund" as a broad-match keyword which will include variations such as "refunds" and "refunded", you can use "refund" as an exact-match keyword to only return results that match that spelling exactly;
- You can improve precision by manually removing matches that do not meet the topic criteria.
User Intent-based topics
If you are using the User Intent criteria in your topics we will automatically optimize for the best balance between recall and precision. However, you can improve the overall accuracy of the model by using our User Intent Training tool.