Much like deep sea fishing, data trawling is a bit more of a catch-all than a targeted pull of specific data-points, but from that larger less focused catch you can then drill down very specifically and effectively. For example, if you import data into a tool like Gephi you can potentially see interactions and comments from each member of a Facebook group or page. Identity of members may be obfuscated, but much can be learned about sets of people, such as mood, opinions, most likely to post or share, etc.
Recent events prompted me to look again into pulling data from Facebook, but here I’ve also looked at Twitter data sweeps. I wanted to note a few things about some apps around that can pull data from social networks, so below I list out my experiences and opinion of some freely available tools: Netvizz, NCapture (NVivo), the Twitter Archiver and F! DataMiner.
A good question to start with is, how do we get data off websites or social media? And, is it legal?
Social media data
NB Most third party apps don’t need to ask for access to friend lists, but some do. If they do, the app can invariably see everything your friends post, just like you can, in your own Facebook newsfeed (dependent on a friends own privacy settings).
NCapture is NVivo’s chrome plug in that can ‘pull’ data from any web source to then import to NVivo software for qualitative data analysis. NVivo can analyse textual data in sophisticated terms, and can be fairly automated for large datasets. It either scrapes pages, or pulls social media content. I was especially interested in the latter.
I went to my Facebook and decided to try to pull the same data I had successfully pulled using Netvizz (see below). NCapture will not permit access to any Facebook Page or Group data unless you are a page admin. The NCapture screenshots included here very cleary indicate how users are required to accept access to their data when a connection is being made between an app and the Facebook API.
NCapture seemed fairly limited in its access to groups and pages on Facebook and in fact would not capture any data, even from public Facebook groups unless I was a registered admin. This implied it was either not going to be very useful, or it would need full access to my friends – which I had denied it – to enable gathering wider sets of data. I need to test more, but when I scraped my own website, all I got was a PDF of a single webpage screenshot, so that was pretty disappointing.
Netvizz pulls data from public groups or pages, within limitations placed by the Facebook API. It generates ‘ GDF’ files that are only accessible using a data tool like Gephi (or VosViewer, for example). Netvizz does not ask for any permission to access your friends, it can already see them because it is itself a Facebook app (not made by Facebook, but built to exist in the Facebook ecosystem). It can see everything in any datasets in Facebook. So Netvizz can pull data from all kinds of sources. It works within the legal limits of the Facebook API so there ARE tight limits, and these are clearly indicated on each type of search and retrieve data trawl. It’s not that sophisticated, but it returns interesting insight. I just conducted a quick ‘page as network’ search of the following three pages. This shows you who else they all connect to, plus some associated data.
- https://www.facebook.com/brexituk/ last visible post Oct 17th 2017, 22k
- https://www.facebook.com/BrexitCentral/ last post January 2018, 15k members
- https://www.facebook.com/GetBritainOut/ last post yesterday. 240k members
Here’s what I got:
The Netvizz screenshot above shows that with the publicly available ID of a page or group, you can pull anonymised data from any public page, pages, or groups. This data include number of users, interactions etc. It does not contain identifiable personal data.
A few fascinating things emerge from this quick page-as-network trawl: you can immediately see how few of the pages allow user posts, so it’s a top-down situation. That leavers connect to ‘British Weights and Measures’, Commonwealth and Royal Print, an old traditional view of Britain in other words. It’s also quickly noticeable how little posting is going on, though activity seems fairly high for some of the large fan count pages. Of course not all these pages are exclusively brexit related, for example the Institute of Economic Affairs. But at a ‘level 1 depth’ – all that is permitted if you query more than one page, the 20 pages that have come up are very brexit campaign orientated.
Here’s a Gephi of the connections. I don’t know much yet about how to use Gephi, but I know enough to show who is connecting to whom, and the direction of the connection. It’s clear from the visualisation then, that Get Britain Out is doing most of the running, with Better Off Out second. Brexit Central was the third page but is only a passive player, with only one directional connection to Vote Leave. Brexit, the second page searched (brexituk), connects to Royal Print and Perfect Signs Edinburgh but isn’t connected to anything else.
Caveat: this was not a scientifically robust analysis!
The Twitter Archiver
This is a really interesting and useful add-on for Google Sheets. If you are familiar with running scripts in Google Sheets you’ll know that all sorts of cool stuff can be made to happen, making Excel tables very useful and clever tools for data analysis.
A while back (last year) I ran a test, searching for some brexit related keywords. The results I got back were very interesting and provided a surprising level of detail. Real names, profiles, locations, as well as tweet text. This data can be further analysed using tools like NVivo for sentiment keywords. This is quite powerful, and quite easy to set up.
More tools for Twitter sentiment analysis are discussed here.
F! Data Miner
Sadly I could not get this to work at all. I finally gave up after trying all usual fixes as I don’t have all day to get what should be a very straightforward tool to work. Technical problems included repeated requests for new passwords and non recognition of email log in info, as well as even when I was logged in the chrome extension simply did not work to pull any data. But, their claims about data retrieval were eye opening. Note the clear emphasis on privacy, legality and that no one else holds your searches (they are not stored in the cloud, only on a local machine). Also note the claim on the extension interface page which proclaims that you can scrape a potentially limitless number of posts and users.
These are not the only tools out there, and I intend over time to test quite a few others. For clarity, it should be pointed out that NVivo is professional level qualitative data analysis, and not to be confused with free software or easy to obtain plug ins. It’s license based, commercial software. Other software like this is ATLAS.ti, QDA Miner, MAXQDA or others. A helpful article is available at predictiveanalyticstoday.
I would also say that I am a novice at data analysis, but perhaps that’s the point. Even a novice with some reasonable technical acumen can pull data off social media (or other data sources) and analyse it to quite sophisticated levels.