Download Connect 2014 presentation files

The show is over and the annual question arises: how do I download all the presentations? To do that, you will need a valid username and password for the Connect 2014 site, no anonymous access here. The 2014 site is build on IBM Portal and IBM Connections. IBM Connections has a ATOM REST API, that opens interesting possibilities. With a few steps you can get hands on all files. I will use CURL to do this.

Create or edit your .netrc file to add your Connect 2014 credentials (in one line)
machine connections.connect2014.com login [YourNumericID] password [YourNumericPassword] (Note [ and ] are NOT part of the line in the .netrc file)
Download the feed. Checking this morning, I found a little more than 500 files. The Connections API allows for max 500 entries per "page", so 2 calls will be sufficient for now. You can check the number of files in the <snx:rank> element in the resulting XML:
curl --netrc -G --basic -L 'https://connections.connect2014.com/files/basic/anonymous/api/documents/feed?sK=created&sO=dsc&visibility=public&page=1&ps=500' > page1.xml
curl --netrc -G --basic -L 'https://connections.connect2014.com/files/basic/anonymous/api/documents/feed?sK=created&sO=dsc&visibility=public&page=2&ps=500' > page2.xml
(explanation of parameters below)
Transformt the resulting files to a shell script using XSLT (see below) java -cp saxon9he.jar net.sf.saxon.Transform -t -s:page1.xml -xsl:connect2014.xslt -o:page1.sh
Make the scripts executable (unless your OS would execute arbitrary files) chmod +x page1.sh
Run the download ./page1.sh

You are dealing with 2 sets of parameters here:

the CURL parameters
- --netrc: pull the user name and password from the .netrc file
- -G: perform a GET operation
- --basic: use basic authentication
- -L: follow redirects (probably not needed here)
- (optional) -v: verbose output
the Connections Files API parameters
- sK=created: sort by creation date
- sO=dsc: sort decending
- visibility=public: show all public files
- page=1|2: what page to show. Start depends on page size
- ps=500: Show 500 files per page (that's the maximum Connections supports

As usual: YMMV
The XSLT used:


<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"  xmlns:snx="http://www.ibm.com/xmlns/prod/sn"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output indent="no" method="text" />
    <xsl:template match="/">#!/bin/bash
# Entries in this feed <xsl:value-of select="atom:feed/snx:rank" />
        echo "Starting downloads"
        <xsl:apply-templates select="atom:feed/atom:entry/atom:link[@rel='enclosure']" />
    </xsl:template>
    <xsl:template match="atom:link">curl --netrc -G --basic -C - -L "<xsl:value-of select="@href"></xsl:value-of>"    -o "<xsl:value-of select="@title" />"
</xsl:template>
</xsl:stylesheet>

Of course, you simply could scout Slideshare.

Update: Stevan Bajić tuned the XSLT to check for the existence of the file with the right size. This makes it a Linux native script. Enjoy! The following script is © 2014 Stevan Bajić


<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:snx="http://www.ibm.com/xmlns/prod/sn" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output indent="no" method="text" />
 <xsl:template match="/">#!/bin/bash
# Entries in this feed <xsl:value-of select="atom:feed/snx:rank" />
  echo "Starting downloads"
  <xsl:apply-templates select="atom:feed/atom:entry/atom:link[@rel='enclosure']" />
 </xsl:template>
 <xsl:template match="atom:link">[ "$(stat "./files/<xsl:value-of select="@title" />" 2>&1 | sed -n "s:^[\t ]*Size\:[\t ]*\([0-9]*\)[\t ].*:\1:gp")" != "<xsl:value-of select="@length" />" ] && curl --netrc -G --basic -L "<xsl:value-of select="@href"></xsl:value-of>"    -o "./files/<xsl:value-of select="@title" />"
</xsl:template>
</xsl:stylesheet>

Posted by Stephan H Wissel on 04 February 2014 | Comments (a) | categories: IBM

posted by Stevan Bajić on Tuesday 04 February 2014 AD:
Hallo Stefan,

from my own experience in the past I know that IBM will add more and more files each time you run the script. So I would suggest adding something like this to the XSLT before the curl execution:

[ -f "<xsl:value-of select="@title" />" ] ||

That will avoid downloading files that you already downloaded.

Cheers,

Stevan

posted by Stevan Bajić on Wednesday 05 February 2014 AD:
The problem with the zip file is that it does not contain all the available files. It does at the time they post the zip file but when presenters publish their material after the zip posting date then that material will not be updated in the zip.

posted by Stephan H. Wissel on Wednesday 05 February 2014 AD:
@Stevan, yes modifying the script to check for existing files makes a lot of sense. I used -C - as parameter for CURL. This would still check the online file, so less efficient than the -f you suggest, but it covers the case that the transfer was disrupted and only partial.

posted by Cristian on Wednesday 05 February 2014 AD:
Hi,
anybody willing to share the slides with me after the script successfully runs? Emoticon smile.gif

Bye
Cristian

PS
Unfortunately I did not join Connect 2014

posted by Stevan Bajić on Wednesday 05 February 2014 AD:
@Stephan, I usually call in such cases curl with -I and then parse the output to capture the content size and then compare it with the local file and if the size does not match then I go on and download the file again. Off course the -C parameter works and resumes the file.

I changed the XSLT to do this header parsing and size comparing. The resulting shell script looks horrible complicated but it does what I want it to do. Now I can run all together in a cron job and wait till everything is downloaded.

posted by Chris Miller on Wednesday 05 February 2014 AD:
For those that didn't go I run the annual database of public files too

{ Link }

They also put out the ZIP eventually of all sessions and this year it is all going to SocialBizUG (they say)

posted by Stevan Bajić on Thursday 06 February 2014 AD:
@Stephan, btw: the server does not support resuming:
curl: (33) HTTP server doesn't seem to support byte ranges. Cannot resume.

So better is to download the whole file again.

The file size is already in the XML so no need to call HEAD against the file. This should speed up processing.

posted by Sriram on Friday 07 February 2014 AD:
Is using IBM Connections Plugin for Windows to copy files folder by folder a good option?!

posted by Stephan H. Wissel on Friday 07 February 2014 AD:
@Sriram: You have to move the files into a folder first since the plug-in doesn't show public files, so NO

posted by Vitor Pereira on Tuesday 11 February 2014 AD:
You don't have to move any files into any folders. Just add the "Portal Admin" user in the plugin, after that it's just copy/paste.