Social Media Country of Origin Analysis – 82% Accurate at Identifying Canadian Posts

by David Johnson on December 28, 2011

Decooda Country of Origin Analysis for “Big Co.” – Identifying Canadian/American Social Media Content


We were recently asked if we could perform a social media analysis to identify “Canadian” origin posts for a Fortune 250 consumer brand company.  The client already has a contract with one of the largest social media analytics solution providers, but they were not capable of performing the analysis.  They specifically asked us to do a proof-of-concept to see if it was feasible to identify Canadian posts with a high-degree of accuracy using a co-mingled North American dataset.

We were provided with an export of data from the current social media provider’s system.  We cleaned and prepared the data so we could perform the analysis.


The goal of the assessment was to establish the degree of accuracy with which Canadian social media posts could be distinguished from American social media posts.


Three forms of results are provided: recall, precision, and harmonic mean. Recall provides the accuracy with which actual posts were found, regardless of the number of false identifications that were made. Precision identifies the accuracy of the identification procedure and thus considers positives and false positives rather than the number of actual posts that exist. The harmonic mean is a value that balances recall and precision and can be thought of as a simple over-all accuracy of the model.

For the American dataset, the harmonic mean was .88 (which can be understood as 88% accuracy).  Recall was .85 and precision was .90, suggesting a good balance between finding power and accuracy of finding power. For the Canadian dataset, the harmonic mean was .76 (which can be understood as 76% accuracy).  Recall was .80 and precision was .73, again suggesting a good balance between finding power and accuracy of finding power. Higher American accuracy is to be expected because Canadian English is a variant of American English rather than the other way around. As such, more Canadian texts will have every appearance of simply being American. Indeed, this issue is the very reason for the current appraisal. The findings can be summed as suggesting that for every 100 documents analyzed, approximately 82 will be correctly categorized.

Discussion and Conclusions

The accuracy of the model is very high (approximately .82). In similar published analysis, any value of around .70 is considered good. Given all of the above analysis techniques, sentiment analysis and discourse variation techniques proved to be the strongest. The weaker approaches should not be thought of as having failed; instead, they are likely to be useful in further analysis at a finer grained level.

For each medium (e.g., blogs, tweets) slightly different models will be applied, reflecting the distribution of data within the texts. Future models will be based upon the initial model described here, and are likely to reflect similar accuracies.


The approaches used in this analysis establish that Canadian and American texts can be categorized with a high degree of accuracy. The approach is automated and therefore fast results can be expected (including real time) with an accuracy that is well above most published categorization findings.

We use a similar approach to perform ethnicity and other forms of demographic analysis, all with a high degree of accuracy.

If you have a unique need, or if you are not getting what you want from your existing social media vendor, please give us a call. We’d like to discuss how we might be able to augment your social media analysis workflow.

David Johnson




Leave a Comment


Previous post:

Next post: