Mining the Social Web - University of Idaho [PDF]

mental concepts, while later chapters systematically build upon the foundation from earlier chapters and gradually intro

3 downloads 5 Views 21MB Size

Recommend Stories


Mining the Social Web
Every block of stone has a statue inside it and it is the task of the sculptor to discover it. Mich

PDF Download Mining the Social Web
What you seek is seeking you. Rumi

idaho mining and exploration, 2008
Pretending to not be afraid is as good as actually not being afraid. David Letterman

idaho mining and exploration, 2013
You miss 100% of the shots you don’t take. Wayne Gretzky

Brigham Young University Idaho
Courage doesn't always roar. Sometimes courage is the quiet voice at the end of the day saying, "I will

The Science of Mining - PDF
We may have all come on different ships, but we're in the same boat now. M.L.King

catalog - BYU-Idaho [PDF]
Computer Science and Electrical Engineering. Richard Grimmett, Chair. Design and Construction Management. Mike Sessions, Chair. Geology. Julie Willis, Chair. Mathematics. Jackie Nygaard, Chair. Mechanical Engineering. Greg Roach, Chair. Physics. Step

web personalization using web usage mining techniques
You often feel tired, not because you've done too much, but because you've done too little of what sparks

A Short Survey of Web Data Mining
Forget safety. Live where you fear to live. Destroy your reputation. Be notorious. Rumi

An Introduction to Web Mining
Respond to every call that excites your spirit. Rumi

Idea Transcript


www.it-ebooks.info

www.it-ebooks.info

Learn how to turn > The Rock (1996) ... ...

These bits of meta /> %s: %s

' % \ (p['image']['url'], p['id'], p['displayName'])] HTML(''.join(html))

4.2. Exploring the Google+ API

www.it-ebooks.info

|

141

Sample results are displayed in Figure 4-2 and provide the “quick fix” that we’re looking for in our search for the particular Tim O’Reilly of O’Reilly Media.

Figure 4-2. Rendering Google+ avatars as images allows you to quickly scan the search results to disambiguate the person you are looking for Although there’s a multiplicity of things we could do with the People API, our focus in this chapter is on an analysis of the textual content in accounts, so let’s turn our attention to the task of retrieving activities associated with this account. As you’re about to find out, Google+ activities are the linchpin of Google+ content, containing a variety of rich content associated with the account and providing logical pivots to other platform ob‐ jects such as comments. To get some activities, we’ll need to tweak the design pattern we applied for searching for people, as illustrated in Example 4-3. Example 4-3. Fetching recent activities for a particular Google+ user import httplib2 import json import apiclient.discovery USER_ID = '107033731246200681024' # Tim O'Reilly

142

|

Chapter 4: Mining Google+: Computing Document Similarity, Extracting Collocations, and More

www.it-ebooks.info

# XXX: Re-enter your API_KEY from # if not currently set # API_KEY = ''

https://code.google.com/apis/console

service = apiclient.discovery.build('plus', 'v1', http=httplib2.Http(), developerKey=API_KEY) activity_feed = service.activities().list( userId=USER_ID, collection='public', maxResults='100' # Max allowed per API ).execute() print json.dumps(activity_feed, indent=1)

Sample results for the first item in the results (activity_feed['items'][0]) follow and illustrate the basic nature of a Google+ activity: { "kind": "plus#activity", "provider": { "title": "Google+" }, "title": "This is the best piece about privacy that I've read in a ...", "url": "https://plus.google.com/107033731246200681024/posts/78UeZ1jdRsQ", "object": { "resharers": { "totalItems": 191, "selfLink": "https://www.googleapis.com/plus/v1/activities/z125xvy..." }, "attachments": [ { "content": "Many governments (including our own, here in the US) ...", "url": "http://www.zdziarski.com/blog/?p=2155", "displayName": "On Expectation of Privacy | Jonathan Zdziarski's Domain", "objectType": "article" } ], "url": "https://plus.google.com/107033731246200681024/posts/78UeZ1jdRsQ", "content": "This is the best piece about privacy that I've read ...", "plusoners": { "totalItems": 356, "selfLink": "https://www.googleapis.com/plus/v1/activities/z125xvyid..." }, "replies": { "totalItems": 48, "selfLink": "https://www.googleapis.com/plus/v1/activities/z125xvyid..." }, "objectType": "note" }, "updated": "2013-04-25T14:46:16.908Z", "actor": {

4.2. Exploring the Google+ API

www.it-ebooks.info

|

143

"url": "https://plus.google.com/107033731246200681024", "image": { "url": "https://lh4.googleusercontent.com/-J8nmMwIhpiA/AAAAAAAAAAI/A..." }, "displayName": "Tim O'Reilly", "id": "107033731246200681024" }, "access": { "items": [ { "type": "public" } ], "kind": "plus#acl", "description": "Public" }, "verb": "post", "etag": "\"WIBkkymG3C8dXBjiaEVMpCLNTTs/d-ppAzuVZpXrW_YeLXc5ctstsCM\"", "published": "2013-04-25T14:46:16.908Z", "id": "z125xvyidpqjdtol423gcxizetybvpydh" }

Each activity object follows a three-tuple pattern of the form (actor, verb, object). In this post, the tuple (Tim O’Reilly, post, note) tells us that this particular item in the results is a note, which is essentially just a status update with some textual content. A closer look at the result reveals that the content is something that Tim O’Reilly feels strongly about as indicated by the title “This is the best piece about privacy that I’ve read in a long time!” and hints that the note is active as evidenced by the number of reshares and comments. If you reviewed the output carefully, you may have noticed that the content field for the activity contains HTML markup, as evidenced by the HTML entity I've that appears. In general, you should assume that the textual content="text/html; charset=UTF-8"/> %s """ blog_ content="text/html; charset=UTF-8"/> %s """ blog_) # Get the collection stats (collstats) on a collection # named "mbox" print json.dumps(db.command("collstats", "mbox"), indent=1) # Use the db.command method to issue a "text" command # on collection "mbox" with parameters, remembering that # we need to use json_util to handle serialization of our JSON print json.dumps(db.command("text", "mbox", search="raptor", limit=1), indent=1, default=json_util.default)

MongoDB’s full-text search capabilities are quite powerful, and you should review the text search documentation to appreciate what is possible. You can search for any term out of a list of terms, search for specific phrases, and prohibit the appearance of certain terms in search results. All fields are initially weighted the same, but it is also even possible to weight fields differently so as to tune the results that may come back from a search. In our Enron corpus, for example, if you were searching for an email address, you might want to weight the To: and From: fields more heavily than the Cc: or Bcc: fields to improve the ranking of returned results. If you were searching for keywords, you might want to weight the appearance of terms in the subject of the message more heavily than their appearance in the content of the message. In the context of Enron, raptors were financial devices that were used to hide hundreds of millions of dollars in debt, from an accounting standpoint. Following are truncated sample query results for the infamous word raptor, produced by running a text query in the MongoDB shell: > db.mbox.runCommand("text", {"search" : "raptor"}) { "queryDebugString" : "raptor||||||", "language" : "english", "results" : [ { "score" : 2.0938471502590676, "obj" : { "_id" : ObjectId("51a983dfe391e8ff964c63a7"), "Content-Transfer-Encoding" : "7bit", "From" : "[email protected]", "X-Folder" : "\\SSHACKL (Non-Privileged)\\Shackleton, Sara\\Inbox", "Cc" : [ "[email protected]" ],

6.3. Analyzing the Enron Corpus

www.it-ebooks.info

|

261

"X-bcc" : "", "X-Origin" : "Shackleton-S", "Bcc" : [ "[email protected]" ], "X-cc" : "'[email protected]'", "To" : [ "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]" ], "parts" : [ { "content" : "Maricela, attached is a draft of one of the...", "contentType" : "text/plain" } ], "X-FileName" : "SSHACKL (Non-Privileged).pst", "Mime-Version" : "1.0", "X-From" : "Ephross, Joel ", "Date" : ISODate("2001-09-21T12:25:21Z"), "X-To" : "Trevino, Maricela

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.