Importing files into a SharePoint document library using regular expressions and WebDAV

Posted on 5/26/2008

I just finished writing a utility to export a folder hierarchy of files from my existing custom extranet to a SharePoint document library. The custom extranet was database-driven and allowed the user to name a file or folder whatever he or she wished up to a maximum of 500 characters. When I wrote this extranet 6 years ago in classic ASP, I'd just HTML encode whatever name the user wished and store it in the database. Whenever a folder or file was retrieved, it was always by using the ever-so-not-user-friendly URL parameter "id=".

I already knew I would need to remove restricted characters from my folder and files names that SharePoint does not allow. Furthermore, SharePoint's document libraries actually display the full folder path in the URL, which means I'll need to be concerned about the total path length.

My migration plan was to build a physical folder hierarchy for staging the files, then use WebDAV (SharePoint's explorer view for document libraries) for importing the hierarchy into SharePoint within Windows. This method will allow me to keep the utility focused on a simpler task than actually importing the files into SharePoint and make sure I don't have to worry about server timeouts.

Naming restrictions

SharePoint has naming restrictions for sites, groups, folders and files. Since I'm only interested in folders and files, only the following restrictions will be considered.

  • Invalid characters: \ / : * ? " ' < > | # { } % ~ &
  • You cannot use the period character consecutively in the middle the name
  • You cannot use the period character as the first or the last character

Someone already familiar with this topic will notice that I added the apostrophe to the official restricted character list. During my own testing, SharePoint complained when I uploaded a file with an apostrophe, so I added it to the list.

Length restrictions

Besides naming restrictions, SharePoint also has the following length restrictions (from KB 894630).

  • A file or a folder name cannot be longer than 128 characters.
  • The total URL length cannot be longer than 260 characters.

128 character limit for folders and files

Regarding the 128 character limit, you can't use SharePoint's UI to get to this limit. The text box's maxlength property is set to 123 for both folders and files. I don't have any inside sources, but my guess is that the SharePoint team did this to make sure the total file name would not exceed 128 characters if the extension was 4 characters (as is the case with Office 2007 file formats like docx and xlsx). The odd thing is that the folder text box is limited to 123 characters as well. However, if you put the document library into Explorer view, you can rename a folder to allow the full 128 characters. I bet there's some reuse going on between the data entry screens for the file and the folder in this case (also something a programmer on the SharePoint team might want to do).

260 character limit for URLs

I've done some WebDAV importing to this particular SharePoint farm in the past, and I'm pretty sure I ran into paths close to the 260 character limit, so I investigated this. I found several instances where the total URL exceeded 260 characters.

KB 894630 mentioned above also says:

To determine the length of a URL, .... convert the string in the URL to Universal Character Set (UCS) Transformation Format 16 (UTF-16) format, and then count the number of 16-bit characters in the string.

However, it should probably say something like "decode the URL first, then count the characters" to make it easier to understand. I created a folder hierarchy to test out the 260 character limit. Following is a URL (notice the %20 space codes) to a test file copied from the address bar of the browser. When the URL is encoded, it contains 346 characters.

http://intranet.xyzco.com/sites/Testing/Documents/A%20longer%20than%20 usual%20folder%20name%20for%20testing/Subfolder%201%20also%20has%20 a%20long%20name/3rd%20level%20subfolder%20about%20related%20 documents/4th%20level%20subfolder%20about%20more%20specific%20 documents/5th%20level%20subfolders%20are%20possible%20in%20this%20 hierarchy/1234567.txt

The decoded URL is:

http://intranet.xyzco.com/sites/Testing/Documents/A longer than usual folder name for testing/Subfolder 1 also has a long name/3rd level subfolder about related documents/4th level subfolder about more specific documents/5th level subfolders are possible in this hierarchy/1234567.txt

Counting the characters in the URL gave me 284. To get closer to 260, I subtracted the 25 characters for the web application:

284 – 25 (Length of http://intranet.abcco.com) = 259 characters

I didn't get a perfect 260, but it's close enough for me to believe that the web application host header name is not included in the limit. This is just a guess on my part, though.

Why the 260 character limit?

A 260 character limit on the URL is interesting, considering both Windows and most internet browsers support paths much longer. It's not merely a coincidence that 260 also just so happens to be the value of the infamous MAX_PATH constant from the Windows API. .NET uses MAX_PATH because .NET relies on the Windows API behind the scenes. There are API workarounds, as discussed on the BCL team blog, but I think it's safe to assume that this limit is imposed on SharePoint by .NET in some way.

Removing invalid characters and patterns using a regular expression

The String object's Replace method doesn't contain an overload for replacing an array of strings, so I looked into using a regular expression to clean folder and file names.

Regular expressions have their own special characters that must be escaped if used for searching:

[ \ ^ $ . | ? * + ( )

Out of these, the following are also SharePoint's invalid characters: * ? | \ These are the characters that will need to be escaped in our regular expression.

After a bit of fiddling, I came up with the following 4 expressions:

  1. [\*\?\|\\/:"'<>#{}%~&] for removing invalid characters
  2. \.{2,} for replacement of consecutive periods
  3. ^[\. ]|[\. ]$ for removing spaces and periods from the beginning and end of a folder or file name
  4. " {2,}" for replacement of consecutive spaces (enclosed by quotation marks so you can see the space)

I added a couple of rules to these expressions because of my migration strategy. Since I'm using WebDAV and building a physical folder hierarchy in Windows, I also need to be concerned about any additional restrictions imposed by the OS (a folder or file name can't end with a space). Also, I'm replacing consecutive spaces with a single space.

All expressions are used by Regex.Replace(). Expressions 1 and 3 are replaced by String.Empty. 2 and 4 are replaced by a period and a space, respectively. In regards to the order of the replacements, it's important that the invalid character replacement is applied first. Combining these expressions and replacing at once might create a problem after invalid characters are replaced. For example, the name %.afile.txt would become .afile.txt if done all at once, violating the rule that a period cannot be the first character.

After all replacements have been made, it's still possible to have one of the rules violated. For example, a folder named "Folder one . and . " (ends with space, period, space) would still be invalid after 1 pass of expression 3. It would still be invalid after a 2nd pass. Because of this, the beginning and end rule should be used in a loop until no matches are found. This doesn't help performance, but I was willing to compromise since my largest extranet (9000 files and hundreds of folders) was processed within a minute. Plus, I know the minute I post this someone's going to read it and say, "What was he thinking? It's so much faster to do it this way...".

Fixing length restrictions

To make sure you include as many characters from the original folder or file name as possible, the naming restrictions should be enforced before the length restrictions.

To know how long a file name can be, it's important to know how close we are to the maximum allowed path length. Since I'm using a physical file hierarchy to stage the files, I can simply check the current folder's path length. Instead of going into too much detail about this, take a look at the maxLength integer in the following code listing. maxLength is what I used to determine how long a folder or file could be given the current path length.

An example method in C#

Following is the method I ended up with, along with some global variable initializations. You'll notice I added the tab character to the invalid characters list. During an export, I found a file name with embedded tab characters, so it was added to the list as well.


private const int MAXFOLDERLENGTH = 128, MAXFILELENGTH = 123;
private int MAXURLLENGTH = 259;

private Regex invalidCharsRegex =
new Regex(@"[\*\?\|\\\t/:""'<>#{}%~&]", RegexOptions.Compiled);

private Regex invalidRulesRegex = 
new Regex(@"\.{2,}", RegexOptions.Compiled);

private Regex startEndRegex = 
new Regex(@"^[\. ]|[\. ]$", RegexOptions.Compiled);

private Regex extraSpacesRegex = 
new Regex(" {2,}", RegexOptions.Compiled);

/// <summary>
/// Returns a folder or file name that 
/// conforms to SharePoint's naming restrictions
/// </summary>
/// <param name="original">
/// The original file or folder name.  
/// For files, this should be the file name without the extension. 
/// </param>
/// <param name="currentPathLength">
/// The current folder's path length
/// </param>
/// <param name="maxItemLength">
/// The maximum allowed number of characters for this file or folder.
/// For a file, it will be MAXFILELENGTH.
/// For a folder, it will be MAXFOLDERLENGTH.
/// </param>
private string GetSharePointFriendlyName(string original
, int currentPathLength, int maxItemLength)
{
// remove invalid characters and some initial replacements
string friendlyName = extraSpacesRegex.Replace(
invalidRulesRegex.Replace(
invalidCharsRegex.Replace(
original, String.Empty).Trim()
, ".")
, " ");

// assign maximum item length
int maxLength = (currentPathLength + maxItemLength > MAXURLLENGTH)
? MAXURLLENGTH - currentPathLength
: maxItemLength;

if (maxLength <= 0)
throw new ApplicationException(
"Current path is too long for importing into SharePoint");

// return truncated name if length exceeds maximum          
if (friendlyName.Length > maxLength)
friendlyName = friendlyName.Substring(0, maxLength - 1).Trim();

// finally, check beginning and end for periods and spaces
while (startEndRegex.IsMatch(friendlyName))
friendlyName = startEndRegex.Replace(
friendlyName, String.Empty);

return friendlyName;
}

A typical call to this method would look similar to the following. In this listing, parent is a DirectoryInfo object pointing to the current folder.

fileName = GetSharePointFriendlyName(fileName
, parent.FullName.Length + 1, MAXFILELENGTH);
folderName = GetSharePointFriendlyName(folderName
, parent.FullName.Length + 1, MAXFOLDERLENGTH);

Testing the import to SharePoint using empty files

The best test would be to actually upload the files via WebDAV to a staging environment. However, if you receive an error message because of name restrictions or path length during the process, it's difficult to pick back up where the error occurred.

To quickly preview an upload, I modified my export utility to create empty files instead of building the folder hierarchy with the actual files. You can use these for a mock import in WebDAV even though SharePoint's UI will not allow you upload an empty file. The following line was used to create the files.

using (StreamWriter sw = File.CreateText(fileName.ToString())) { };

The using statement makes sure the StreamWriter is closed after the file is created. I learned this the hard way when the OS threw an exception about a file being locked.

Another benefit of using empty files is to preview the migration for your users. They can browse the document library and offer their approval. Since we've had to remove some characters and possibly truncate names, this could be very important to the success of the migration.

Export Utility

Just to offer some eye candy for this post, I ended up with something that looked like this:

Export utility screenshot

12 comments:

  1. Since the original post, I've had a little more reflection time on my example method, and have some suggestions for making it better. The maxLength assignment and the ApplicationException should be removed from the example method to increase performance. These 2 lines do not require anything within the method in order to run successfully.

    After maxLength is calculated outside the method, it can be passed in using the maxItemLength parameter. The would also mean you no longer need the currentPathLength parameter. If a lot of items or folders exist in a given folder, this change should result in a small boost in execution time.

    Considering you place the removed lines into a new method called GetMaxItemLength(), the new calling code would look something like the following.

    private void ExportFolder(string name, int maxItemLength)
    {
    // new folder created
    int folderPathLength = folder.FullName.Length + 1;

    // calculate maximum item length for files
    maxItemLength = GetMaxItemLength(folderPathLength, MAXFILELENGTH);

    // loop for each file
    fileName = GetSharePointFriendlyName(fileName , maxItemLength);

    // calculate maximum item length for subfolders
    maxItemLength = GetMaxItemLength(folderPathLength, MAXFOLDERLENGTH);

    // loop for each subfolder and recursively call this method
    ExportFolder(subfolderName, maxItemLength);
    }

    ReplyDelete
  2. Thanks for the code snippet Tim. Saved me some Regex thinking which can be a real pain in the behind!

    ReplyDelete
  3. Is there a utility available on the market to preview and edit file names before importing into Sharepoint?

    Thanks

    ReplyDelete
  4. I haven't used it, but I think you might be interested in the SharePoint SUSHI project on CodePlex.

    ReplyDelete
  5. I like the post you have here, i was curious if there would be a vs solution that i could use so that i can learn how to do this. I have a very close requirement to what you are doing for what im looking to complete.

    ReplyDelete
  6. Anonymous, I didn't upload a code project since my file source is proprietary. The method above (and my subsequent comment) is the most important, and would simply need to be dropped into whatever collection you're iterating through.

    ReplyDelete
  7. Thanks for a very helpful post. I really appreciated the background information and reasoning behind your decisions.

    ReplyDelete
  8. Thanks for this helpful post...it surely saved a few hours for me...:)

    ReplyDelete
  9. Beautiful post. Thanks for sharing.
    zee
    walisystems.com

    ReplyDelete
  10. Hi Tim!
    How can I run this kind of program outside of SharePoint environment? Like, if i try to make a windows app that runs SharePoint code, it won't work outside of SharePoint server cuz of the dlls... How can I solve it?

    ReplyDelete
  11. Jony, I'm not I understand your question. This code is/was running outside of the SharePoint environment. It was interacting with a custom database (my custom extranet) and the local file system. Once all the files were placed in a folder, I manually uploaded via Windows Explorer and WEBDAV. Hope that helps!

    ReplyDelete
  12. Thanks man, great solution, saved me some headache!

    ReplyDelete