The Good Journal #9 A Quest for Good AI

July 4, 2024

The Good Journal #9 A Quest for Good AI

Soon, we’ll be rolling out Nextcloud server v27 as our production version and will start slowly rolling out the upgrade to every environment. With that, some neat new features have been introduced, most noticeably the AI features.

We’re not too fond of the general usage of the term “AI”. This is not the new Skynet, and it’s not on the way to creating terminators. We like “Large Language Model” or “LLM” much more. Because that is all it is, a big box of data with an index able to give you a guestimation of what you’d likely want to see next based on the input you’ve provided. At first glance, it may pass a Turing test, but it has no intelligence. https://plato.stanford.edu/entries/chinese-room/

The difference in Ethics between development and hosting

Nextcloud has set a rating system for judging the Ethical standards of the “AI” features and apps you can use within Nextcloud. This is done from a developer perspective, and as long as you can control the software it runs on, use the software the model is trained with and curate the training data yourself, it gets a good rating.

This is sufficient effort from the developer’s point of view. When you consider hosting such things, it gets a little more complicated. If you simply offer a VM with some hardware geared towards running such software, you have to leave the selection of the models up to the admin or user. This is how most AI-as-a-service handles this problem. It leaves it up to the user to determine what models they find ethical and acceptable. If there were a clear contender for this, it would not be a problem. However, I have not found a model that clearly avoids copyright issues. Some people are working on this concept: https://huggingface.co/blog/Pclanglais/common-corpus

This proves that you do not have to accept copyright violations, a situation currently being framed as unavoidable by many “AI” companies. But this has not yet evolved into a usable model.

We do “managed hosting.” We’re involved with usage and workflow, so we share responsibility for this final step. Ideally, we would utilize a model trained with a public-domain dataset, but this work is currently incomplete and unusable.

What about the visual models?

It essentially contains the same issue. At the time of writing, several companies have trained models using Creative Commons or their own images, avoiding the copyright issue in the training data (Again showing that this can be done), but neither the model nor the software used to train the model nor the training data is available. This would still be negative regarding Nextcloud’s rating system, and they lack an API to connect.

The Good, the Bad, the Ethical.

We have an additional requirement for the Ethical rating: the model used must be trained on curated data to avoid any copyright infringement issues. It’s not sufficient to technically be able to gather, select, and curate data and train your own model. As a small hosting company, we do not have an ethics or AI department, which limits our ability to curate data and train models. Additionally, the usage must stay within reasonable CPU and memory limits to avoid a large price increase.

But what if we disagree?

We want to give you the information as we see it and clarify some of our decisions. We will host some “AI” features and software and not others, but if you want to connect to ChatGPT or a VPS running LocalAI, we’re happy to help you connect it up and apply the API keys to your environment.

Remember that using this can cause your data to be processed outside of your country or by a company in a different country, which can break the digital sovereignty of your data. Even if you, for instance, utilize a VPS to host LocalAI on Amazon or even DigitalOcean, these companies and their servers are subjected to the laws of the countries in which they are housed.

Let’s focus on the practical implications. There are several areas in Nextcloud where an LLM is used. I’ve provided the rating Nextcloud has given it and the one we would give it, as well as a status for the usage.

Text Generation

LocalAI

Currently, there is no model we could host or recommend that entirely avoids using copyrighted material.

Status: Tested, can be connected on request
Nextcloud rating: Green
TheGoodCloud: Yellow

OpenAI

This is ChatGPT 4, a controversial model known for containing copyrighted material. It’s likely no surprise to anyone reading this that it ticks none of the boxes. It works rather well, but it’s not as open as you’d expect from the name.

Status: Tested, API connection can be requested.
Nextcloud rating: Red
TheGoodCloud: Red

Images

Recognize app

Image, object and face recognition.

We do not enable this app by default, and we’re currently testing its functionality. It’s heavy on resources and likely will not function properly in our smaller consumer environments without us increasing the price for these environments to dedicate more CPU and Memory. The models come fully trained and do not incorporate any corrections or adjustments from Nextcloud itself. The training data for objects, faces, and actions is available, but information on any curating of the training data is lacking. The training data for the music genre recognition model is unavailable. Somehow, this still fetches a green, ethical rating from Nextcloud, but not so much from us.

Status: Testing, not ready. Creates a high load on small servers.
Nextcloud rating: Green
TheGoodCloud: Yellow (not curated to avoid copyright issues)

OpenAI

Dall-e image generation.
Status: Tested, works. External API, not open.
Nextcloud rating: Red
TheGoodCloud: Red

LocalAI

Uses a StableDiffusion model. These models cause a lot of debate as they are known to contain copyrighted images and artworks.

Status: Works, can be self-hosted and connected using an API key.
Nextcloud rating: Yellow
TheGoodCloud: Orange

Translations

Translate

This app utilizes the Opus models by the University of Helsinki. It is fully open-source. The data source was hard to find. However, the OPUS dataset compiles multilingual content with a free license to train a translation model, such as translated Wikipedia articles. There is limited diversity in the languages supported.

Status: Tested and can be requested.
Nextcloud rating: Green
TheGoodCloud: Green

LibreTranslate integration

Requires running Libretranslate server somewhere. https://github.com/LibreTranslate/LibreTranslate

It is indeed open source but must be hosted on a separate server. I have not found a mention of how the training data was curated and collected. Until I know where that came from, it doesn’t score as high as the translate app. If I do find it and it is indeed curated to avoid copyright issues. (which I do suspect is true), We can run it in our Kubernetes cluster and offer it as a paid add-on, but the translate app will likely suffice for most users.

Status: Testing/information incomplete.
Nextcloud rating: Green
TheGoodCloud: Yellow (needs more information)

Deepl Integration

Nothing is open source or available in the slightest. This is only for connecting the API. If you already use Deepl in your workflow, this can be handy, but if you’re looking for an ethical translation option, we’d recommend the Translate app.

Status: Connecting your Deepl account is available on request.
Nextcloud rating: Red
TheGoodCloud: Red

OpenAI

OpenAI has some very nice models and features, but none of the training data is open or actively curated to avoid copyright issues.

Status: Connection is tested. We can add the API key to the server if requested.
Nextcloud rating: Red
TheGoodCloud: Red

LocalAI

Currently, there is no model we could host or recommend that entirely avoids using copyrighted material.

Status: The connection is tested and can be connected with an API key.
Nextcloud rating: Green
TheGoodCloud: Yellow

In general:

We recommend the translate app. It’s local and open source, training data is available, and current models are already created using carefully curated data.

Speech-to-text options

This is not useful for dictation but can be used to generate a transcript for a presentation, for example.

Whisper Speech-To-Text app

The software is open-source, but the training data is not available.

Status: Testing/ slow
Nextcloud rating: Yellow
TheGoodCloud: Orange (training data is not available and not curated to avoid copyright issues)

Replicate app

Status: Works, external API
Nextcloud rating: Yellow
TheGoodCloud: Orange (training data not available and not curated to avoid copyright issues)

OpenAI

Status: Works, external API. We can add the API key for your OpenAI account to your environment on request.
Nextcloud rating: Yellow
TheGoodCloud: Orange (training data not available and not curated to avoid copyright issues)

LocalAI

The software to run LocalAI is open source and can be self-hosted. However, the model’s training data is not available. This requires a separate server and setup.

Status: The connection is tested and will work, but TheGoodCloud will not host LocalAI or offer it as an add-on. If requested, we will apply your API key from your own hosted instance of LocalAI to the Nextcloud server.
Nextcloud rating: Yellow
TheGoodCloud: Orange

In general:

All speech-to-text options for Nextcloud rely on OpenAI’s whisper models, which are not freely available or curated to avoid copyright issues.

Misc

Mail

It is a separate app we don’t enable by default, but it is often requested.

The model is created and trained on-premises based on the user’s own data. It prioritizes your mail. Data will need to be gathered from your usage before accurately anticipating your workflow. All of this is done locally, and so we’re happy to enable this for you.

Status: Tested and can be requested.
Nextcloud rating: Green
TheGoodCloud: Green

The model is created and trained locally. It helps flag login attempts that might be an issue. This is ethically fine and already enabled in most environments.

Status: Tested and shipped.
Nextcloud rating: Green
TheGoodCloud: Green

Trained locally by usage. All software is open source. (Nextcloud)

Status: Tested and can be requested.
Nextcloud rating: Green
TheGoodCloud: Green

As someone relying on accessibility software, I am very excited about the development of Large Language Models and their advancements. And none of this is to make a judgment on who is utilizing what and why. I very much understand that some of these features can help a lot of users in a lot of ways, but let’s be honest; you would not be reading this if you were not curious about how we try to do Good while offering the “AI” features. If I have missed some information in this blog post or if I have inadvertently misinterpreted some things, please let me know.

The Good Journal #9 A Quest for Good AI